Running a gbase database distributed cluster in production isn't just about keeping it up — it's about keeping it fast, stable, and easy to troubleshoot. This guide covers deployment pitfalls, performance tuning, daily checks, and common failure recovery with ready‑to‑run commands.
1. Deployment Pitfalls: Eliminate Problems at the Source
1.1 Hardware Sizing: Fit the Workload, Not the Budget
| Cluster Size | Data / Concurrency | CN Spec | DN Spec | Minimum DNs |
|---|---|---|---|---|
| Small | ≤100 GB, ≤500 concurrent | 2 vCPU / 4 GB RAM, 100 GB disk | 4 vCPU / 8 GB RAM, 500 GB disk | 2 |
| Medium | 100 GB–1 TB, 500–2000 concurrent | 4 vCPU / 8 GB RAM, 200 GB disk | 8 vCPU / 16 GB RAM, 1 TB disk | 4 |
| Large | ≥1 TB, ≥2000 concurrent | 8 vCPU / 16 GB RAM, 500 GB disk | 16 vCPU / 32 GB RAM, 2 TB+ disk | 8+ |
Always prefer SSDs. Never mix nodes with different hardware specs — it causes load imbalance.
1.2 Network: Keep Latency Under 10ms
All nodes must be on the same LAN with latency ≤ 10 ms. Cross‑site deployments need dedicated lines with ≤ 50 ms latency. Disable firewalls or open ports 5432 and 26000. Bandwidth: ≥ 1 Gbps for medium clusters, ≥ 10 Gbps for large ones.
1.3 Parameters: Never Use Defaults
-- Connections
ALTER SYSTEM SET max_connections = 2000;
ALTER SYSTEM SET max_prepared_transactions = 1000;
-- Memory (50–60% of node RAM)
ALTER SYSTEM SET shared_buffers = 8GB; -- adjust to your node
ALTER SYSTEM SET work_mem = 64MB;
ALTER SYSTEM SET maintenance_work_mem = 1GB;
-- SSD I/O tuning
ALTER SYSTEM SET effective_io_concurrency = 200;
ALTER SYSTEM SET random_page_cost = 1.1;
-- WAL sync
ALTER SYSTEM SET wal_buffers = 64MB;
ALTER SYSTEM SET synchronous_commit = on;
SELECT pg_reload_conf();
2. Performance Tuning: Get 50% Faster Response
2.1 SQL Tuning
-- Add indexes on frequent filter columns
CREATE INDEX idx_user_userid ON "user"(user_id);
CREATE INDEX idx_order_orderno_createtime ON "order"(order_no, create_time);
-- Avoid SELECT *
SELECT user_id, user_name, phone FROM "user" WHERE user_id = 123;
-- Use COPY for bulk loads (10x faster than INSERT)
COPY "user"(user_id, user_name, phone) FROM '/data/user_data.csv' WITH CSV;
2.2 Memory Tuning
-- shared_buffers: 50–60% of RAM
ALTER SYSTEM SET shared_buffers = 8GB; -- for 16GB node
-- work_mem: smaller for high concurrency, larger for complex queries
ALTER SYSTEM SET work_mem = 32MB; -- high concurrency
ALTER SYSTEM SET work_mem = 128MB; -- low concurrency, complex queries
-- Background writer
ALTER SYSTEM SET bgwriter_delay = 200ms;
ALTER SYSTEM SET bgwriter_lru_maxpages = 100;
SELECT pg_reload_conf();
2.3 I/O Tuning
ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();
Place data and WAL logs on separate SSDs. Schedule backups during off‑peak hours, favour incremental backups, and regularly archive old data.
2.4 Shard Optimisation
Hash‑shard critical tables on high‑frequency join/query columns (e.g., user_id). Shard count = number of DNs × 2~4.
-- Check shard distribution
SELECT * FROM pg_dist_shard;
SELECT node_name, count(*) AS shard_count FROM pg_dist_shard GROUP BY node_name;
3. Daily Health Checks (10 Minutes a Day)
3.1 Cluster Status (Every Day)
gs_om -t status --detail # All nodes must be Normal
top # CPU ≤70%, Memory ≤80%
iostat -x 1 # %util ≤80%
SELECT count(*) FROM pg_stat_activity;
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
SELECT * FROM pg_stat_replication;
3.2 Data Consistency (Weekly)
gs_sync_check
gs_basebackup -C -D /backup/gbase8c/full/... -h ... -p 5432 -U backup_user -v
grep -E "error|warn|fail" /GBase_HOME/log/gbase-xxxx.log
4. Common Failures and Quick Fixes
4.1 DN Node Offline
gbase_ctl status -D /data/gbase8c/dn1
grep -E "error|fail" /GBase_HOME/log/gbase-xxxx.log
# Restart the node
gbase_ctl start -D /data/gbase8c/dn1
# Or restart network
systemctl restart network
# Or recover from backup
4.2 Slow SQL Queries
SELECT pid, query, duration FROM pg_stat_activity WHERE duration > 10000;
EXPLAIN ANALYZE SELECT ...;
Fix with indexes, SQL rewrites, or increasing work_mem.
4.3 Backup Failures
- Permission denied:
GRANT BACKUP, REPLICATION ON DATABASE gbase TO backup_user; - Disk full: purge expired backups and WAL logs.
4.4 Lock Wait Timeout
SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type = 'Lock';
SELECT pg_terminate_backend(12345);
ALTER SYSTEM SET lock_timeout = 30000;
SELECT pg_reload_conf();
4.5 Cluster Won't Start Due to Invalid Configuration
Check postgresql.conf for wrong parameters. If you can't find the error, restore from the sample file and re‑tune.
5. Key Takeaways
- Deploy correctly: match hardware and network to your scale.
- Tune in order: SQL and sharding first, then memory and I/O.
- Inspect daily: cluster status, resource usage, log anomalies.
- Diagnose before acting: always check logs first, then apply the matching fix from this guide.
The goal of operations isn't to fight fires — it's to prevent them. Regular checks and a systematic tuning approach keep your gbase database cluster fast, stable, and predictable.
Top comments (0)