Michael

Posted on Jun 11 • Originally published at gbase.cn

GBase 8c Distributed Cluster Operations: Pitfalls, Performance Tuning, and Quick Fixes

#gbase #database #数据库 #operations

Running a gbase database distributed cluster in production isn't just about keeping it up — it's about keeping it fast, stable, and easy to troubleshoot. This guide covers deployment pitfalls, performance tuning, daily checks, and common failure recovery with ready‑to‑run commands.

1. Deployment Pitfalls: Eliminate Problems at the Source

1.1 Hardware Sizing: Fit the Workload, Not the Budget

Cluster Size	Data / Concurrency	CN Spec	DN Spec	Minimum DNs
Small	≤100 GB, ≤500 concurrent	2 vCPU / 4 GB RAM, 100 GB disk	4 vCPU / 8 GB RAM, 500 GB disk	2
Medium	100 GB–1 TB, 500–2000 concurrent	4 vCPU / 8 GB RAM, 200 GB disk	8 vCPU / 16 GB RAM, 1 TB disk	4
Large	≥1 TB, ≥2000 concurrent	8 vCPU / 16 GB RAM, 500 GB disk	16 vCPU / 32 GB RAM, 2 TB+ disk	8+

Always prefer SSDs. Never mix nodes with different hardware specs — it causes load imbalance.

1.2 Network: Keep Latency Under 10ms

All nodes must be on the same LAN with latency ≤ 10 ms. Cross‑site deployments need dedicated lines with ≤ 50 ms latency. Disable firewalls or open ports 5432 and 26000. Bandwidth: ≥ 1 Gbps for medium clusters, ≥ 10 Gbps for large ones.

1.3 Parameters: Never Use Defaults

-- Connections
ALTER SYSTEM SET max_connections = 2000;
ALTER SYSTEM SET max_prepared_transactions = 1000;

-- Memory (50–60% of node RAM)
ALTER SYSTEM SET shared_buffers = 8GB;        -- adjust to your node
ALTER SYSTEM SET work_mem = 64MB;
ALTER SYSTEM SET maintenance_work_mem = 1GB;

-- SSD I/O tuning
ALTER SYSTEM SET effective_io_concurrency = 200;
ALTER SYSTEM SET random_page_cost = 1.1;

-- WAL sync
ALTER SYSTEM SET wal_buffers = 64MB;
ALTER SYSTEM SET synchronous_commit = on;

SELECT pg_reload_conf();

2. Performance Tuning: Get 50% Faster Response

2.1 SQL Tuning

-- Add indexes on frequent filter columns
CREATE INDEX idx_user_userid ON "user"(user_id);
CREATE INDEX idx_order_orderno_createtime ON "order"(order_no, create_time);

-- Avoid SELECT *
SELECT user_id, user_name, phone FROM "user" WHERE user_id = 123;

-- Use COPY for bulk loads (10x faster than INSERT)
COPY "user"(user_id, user_name, phone) FROM '/data/user_data.csv' WITH CSV;

2.2 Memory Tuning

-- shared_buffers: 50–60% of RAM
ALTER SYSTEM SET shared_buffers = 8GB;   -- for 16GB node

-- work_mem: smaller for high concurrency, larger for complex queries
ALTER SYSTEM SET work_mem = 32MB;        -- high concurrency
ALTER SYSTEM SET work_mem = 128MB;       -- low concurrency, complex queries

-- Background writer
ALTER SYSTEM SET bgwriter_delay = 200ms;
ALTER SYSTEM SET bgwriter_lru_maxpages = 100;
SELECT pg_reload_conf();

2.3 I/O Tuning

ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();

Place data and WAL logs on separate SSDs. Schedule backups during off‑peak hours, favour incremental backups, and regularly archive old data.

2.4 Shard Optimisation

Hash‑shard critical tables on high‑frequency join/query columns (e.g., user_id). Shard count = number of DNs × 2~4.

-- Check shard distribution
SELECT * FROM pg_dist_shard;
SELECT node_name, count(*) AS shard_count FROM pg_dist_shard GROUP BY node_name;

3. Daily Health Checks (10 Minutes a Day)

3.1 Cluster Status (Every Day)

gs_om -t status --detail   # All nodes must be Normal

top                         # CPU ≤70%, Memory ≤80%
iostat -x 1                 # %util ≤80%

SELECT count(*) FROM pg_stat_activity;
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
SELECT * FROM pg_stat_replication;

3.2 Data Consistency (Weekly)

gs_sync_check
gs_basebackup -C -D /backup/gbase8c/full/... -h ... -p 5432 -U backup_user -v
grep -E "error|warn|fail" /GBase_HOME/log/gbase-xxxx.log

4. Common Failures and Quick Fixes

4.1 DN Node Offline

gbase_ctl status -D /data/gbase8c/dn1
grep -E "error|fail" /GBase_HOME/log/gbase-xxxx.log

# Restart the node
gbase_ctl start -D /data/gbase8c/dn1
# Or restart network
systemctl restart network
# Or recover from backup

4.2 Slow SQL Queries

SELECT pid, query, duration FROM pg_stat_activity WHERE duration > 10000;
EXPLAIN ANALYZE SELECT ...;

Fix with indexes, SQL rewrites, or increasing work_mem.

4.3 Backup Failures

Permission denied: GRANT BACKUP, REPLICATION ON DATABASE gbase TO backup_user;
Disk full: purge expired backups and WAL logs.

4.4 Lock Wait Timeout

SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type = 'Lock';
SELECT pg_terminate_backend(12345);
ALTER SYSTEM SET lock_timeout = 30000;
SELECT pg_reload_conf();

4.5 Cluster Won't Start Due to Invalid Configuration

Check postgresql.conf for wrong parameters. If you can't find the error, restore from the sample file and re‑tune.

5. Key Takeaways

Deploy correctly: match hardware and network to your scale.
Tune in order: SQL and sharding first, then memory and I/O.
Inspect daily: cluster status, resource usage, log anomalies.
Diagnose before acting: always check logs first, then apply the matching fix from this guide.

The goal of operations isn't to fight fires — it's to prevent them. Regular checks and a systematic tuning approach keep your gbase database cluster fast, stable, and predictable.

DEV Community