DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8c Distributed Cluster Operations: Pitfalls, Performance Tuning, and Quick Fixes

Running a gbase database distributed cluster in production isn't just about keeping it up — it's about keeping it fast, stable, and easy to troubleshoot. This guide covers deployment pitfalls, performance tuning, daily checks, and common failure recovery with ready‑to‑run commands.

1. Deployment Pitfalls: Eliminate Problems at the Source

1.1 Hardware Sizing: Fit the Workload, Not the Budget

Cluster Size Data / Concurrency CN Spec DN Spec Minimum DNs
Small ≤100 GB, ≤500 concurrent 2 vCPU / 4 GB RAM, 100 GB disk 4 vCPU / 8 GB RAM, 500 GB disk 2
Medium 100 GB–1 TB, 500–2000 concurrent 4 vCPU / 8 GB RAM, 200 GB disk 8 vCPU / 16 GB RAM, 1 TB disk 4
Large ≥1 TB, ≥2000 concurrent 8 vCPU / 16 GB RAM, 500 GB disk 16 vCPU / 32 GB RAM, 2 TB+ disk 8+

Always prefer SSDs. Never mix nodes with different hardware specs — it causes load imbalance.

1.2 Network: Keep Latency Under 10ms

All nodes must be on the same LAN with latency ≤ 10 ms. Cross‑site deployments need dedicated lines with ≤ 50 ms latency. Disable firewalls or open ports 5432 and 26000. Bandwidth: ≥ 1 Gbps for medium clusters, ≥ 10 Gbps for large ones.

1.3 Parameters: Never Use Defaults

-- Connections
ALTER SYSTEM SET max_connections = 2000;
ALTER SYSTEM SET max_prepared_transactions = 1000;

-- Memory (50–60% of node RAM)
ALTER SYSTEM SET shared_buffers = 8GB;        -- adjust to your node
ALTER SYSTEM SET work_mem = 64MB;
ALTER SYSTEM SET maintenance_work_mem = 1GB;

-- SSD I/O tuning
ALTER SYSTEM SET effective_io_concurrency = 200;
ALTER SYSTEM SET random_page_cost = 1.1;

-- WAL sync
ALTER SYSTEM SET wal_buffers = 64MB;
ALTER SYSTEM SET synchronous_commit = on;

SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode

2. Performance Tuning: Get 50% Faster Response

2.1 SQL Tuning

-- Add indexes on frequent filter columns
CREATE INDEX idx_user_userid ON "user"(user_id);
CREATE INDEX idx_order_orderno_createtime ON "order"(order_no, create_time);

-- Avoid SELECT *
SELECT user_id, user_name, phone FROM "user" WHERE user_id = 123;

-- Use COPY for bulk loads (10x faster than INSERT)
COPY "user"(user_id, user_name, phone) FROM '/data/user_data.csv' WITH CSV;
Enter fullscreen mode Exit fullscreen mode

2.2 Memory Tuning

-- shared_buffers: 50–60% of RAM
ALTER SYSTEM SET shared_buffers = 8GB;   -- for 16GB node

-- work_mem: smaller for high concurrency, larger for complex queries
ALTER SYSTEM SET work_mem = 32MB;        -- high concurrency
ALTER SYSTEM SET work_mem = 128MB;       -- low concurrency, complex queries

-- Background writer
ALTER SYSTEM SET bgwriter_delay = 200ms;
ALTER SYSTEM SET bgwriter_lru_maxpages = 100;
SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode

2.3 I/O Tuning

ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode

Place data and WAL logs on separate SSDs. Schedule backups during off‑peak hours, favour incremental backups, and regularly archive old data.

2.4 Shard Optimisation

Hash‑shard critical tables on high‑frequency join/query columns (e.g., user_id). Shard count = number of DNs × 2~4.

-- Check shard distribution
SELECT * FROM pg_dist_shard;
SELECT node_name, count(*) AS shard_count FROM pg_dist_shard GROUP BY node_name;
Enter fullscreen mode Exit fullscreen mode

3. Daily Health Checks (10 Minutes a Day)

3.1 Cluster Status (Every Day)

gs_om -t status --detail   # All nodes must be Normal

top                         # CPU ≤70%, Memory ≤80%
iostat -x 1                 # %util ≤80%
Enter fullscreen mode Exit fullscreen mode
SELECT count(*) FROM pg_stat_activity;
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
SELECT * FROM pg_stat_replication;
Enter fullscreen mode Exit fullscreen mode

3.2 Data Consistency (Weekly)

gs_sync_check
gs_basebackup -C -D /backup/gbase8c/full/... -h ... -p 5432 -U backup_user -v
grep -E "error|warn|fail" /GBase_HOME/log/gbase-xxxx.log
Enter fullscreen mode Exit fullscreen mode

4. Common Failures and Quick Fixes

4.1 DN Node Offline

gbase_ctl status -D /data/gbase8c/dn1
grep -E "error|fail" /GBase_HOME/log/gbase-xxxx.log

# Restart the node
gbase_ctl start -D /data/gbase8c/dn1
# Or restart network
systemctl restart network
# Or recover from backup
Enter fullscreen mode Exit fullscreen mode

4.2 Slow SQL Queries

SELECT pid, query, duration FROM pg_stat_activity WHERE duration > 10000;
EXPLAIN ANALYZE SELECT ...;
Enter fullscreen mode Exit fullscreen mode

Fix with indexes, SQL rewrites, or increasing work_mem.

4.3 Backup Failures

  • Permission denied: GRANT BACKUP, REPLICATION ON DATABASE gbase TO backup_user;
  • Disk full: purge expired backups and WAL logs.

4.4 Lock Wait Timeout

SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type = 'Lock';
SELECT pg_terminate_backend(12345);
ALTER SYSTEM SET lock_timeout = 30000;
SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode

4.5 Cluster Won't Start Due to Invalid Configuration

Check postgresql.conf for wrong parameters. If you can't find the error, restore from the sample file and re‑tune.

5. Key Takeaways

  • Deploy correctly: match hardware and network to your scale.
  • Tune in order: SQL and sharding first, then memory and I/O.
  • Inspect daily: cluster status, resource usage, log anomalies.
  • Diagnose before acting: always check logs first, then apply the matching fix from this guide.

The goal of operations isn't to fight fires — it's to prevent them. Regular checks and a systematic tuning approach keep your gbase database cluster fast, stable, and predictable.

Top comments (0)