DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8c Backup, Recovery, and Data Security: A Production‑Ready Guide

Backup and recovery is the last line of defense for your gbase database. This guide covers backup types, strategy design, recovery procedures with full step‑by‑step instructions, and security measures.

1. Backup Types and Use Cases

Type Characteristics Speed Recovery Speed Use Case
Full All data, metadata, config Slow Fast Weekly, before major changes
Incremental Changes since last backup Fast Medium (requires full + all incrementals) Daily
Differential Changes since last full Moderate Medium Moderate change, shorter chain
WAL Log Real‑time write‑ahead log Very fast Point‑in‑time Always used to minimise RPO

Backups are coordinated by the CN; each DN backs up its own data. Recovery must follow the “CN first, then DNs” order.

2. Backup Strategy Design

Core principles:

  • Meet RTO/RPO targets (e.g., RTO ≤ 30 min, RPO ≤ 5 min).
  • Run during off‑peak hours (0:00–3:00).
  • Combine full + incremental + WAL archiving.
  • Validate every backup immediately; alert on failure.
  • Store backups on a physically separate device, plus an offsite copy.

Example strategy (1 TB cluster):

  • Full: Sunday 0:00, retained 30 days.
  • Incremental: Mon–Sat 0:00, retained 7 days.
  • WAL: real‑time archiving, 5‑minute rotation, retained 15 days.
  • Validation: automatic after each backup.
  • Storage: dedicated NAS + offsite backup.

3. Essential Backup Commands with Steps

3.1 Full Backup

# Run on the CN node to back up all data, metadata, and configuration
gs_basebackup -D /backup/gbase8c/full/$(date +%Y%m%d) \
  -h 192.168.1.100 \   # CN IP
  -p 5432 \            # CN port
  -U backup_user \     # Dedicated backup user with necessary privileges
  -F c \               # Custom format for easy restore
  -X stream \          # Stream backup to reduce I/O
  -P \                 # Show progress
  -v                   # Verbose logging
Enter fullscreen mode Exit fullscreen mode

3.2 Incremental Backup

# Specify the previous full backup directory
gs_backup incremental \
  --backup-dir /backup/gbase8c/incremental/$(date +%Y%m%d) \
  --base-backup-dir /backup/gbase8c/full/20260330 \   # Previous full backup path
  --host 192.168.1.100 \
  --port 5432 \
  --user backup_user \
  --verbose
Enter fullscreen mode Exit fullscreen mode

3.3 Enable WAL Archiving

-- Configure automatic archiving to prevent log loss
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET archive_mode = on;
ALTER SYSTEM SET archive_command = 'cp %p /backup/gbase8c/wal/%f';
ALTER SYSTEM SET archive_timeout = 300;   -- Rotate every 5 minutes
SELECT pg_reload_conf();                  -- Apply without restart
Enter fullscreen mode Exit fullscreen mode

3.4 Backup Validation

# Verify full backup integrity
gs_basebackup -C -D /backup/gbase8c/full/20260330 \
  -h 192.168.1.100 -p 5432 -U backup_user -v

# Verify WAL log file
pg_waldump /backup/gbase8c/wal/20260330/000000010000000000000001 --verify
Enter fullscreen mode Exit fullscreen mode

4. Recovery Procedures with Step‑by‑Step Instructions

Scenario 1: Single DN Node Data Loss

Symptom: DN2 node disk failure, all data lost, node marked offline, related shards inaccessible.

Prerequisites: Hardware replaced; recent full/incremental backups and complete WAL logs available.

Steps:

# 1. Stop the faulty node service (if still running)
gbase_ctl stop -D /data/gbase8c/dn2

# 2. Remove corrupted data directory
rm -rf /data/gbase8c/dn2/*

# 3. Restore full backup to DN2
gs_basebackup -R -D /data/gbase8c/dn2 \
  -h 192.168.1.100 \   # CN IP
  -p 5432 \
  -U backup_user \
  -F c \
  -f /backup/gbase8c/full/20260330/full_backup.tar

# 4. Restore incremental backup if available
gs_backup restore incremental \
  --backup-dir /backup/gbase8c/incremental/20260331 \
  --target-dir /data/gbase8c/dn2 \
  --user backup_user \
  --verbose

# 5. Apply WAL logs to synchronise to the moment before failure
#    Specify the failure time (adjust as needed)
pg_waldump /backup/gbase8c/wal/20260331/ --start-time '2026-03-31 08:00:00'
pg_basebackup -X fetch -D /data/gbase8c/dn2 --wal-method=stream

# 6. Start the node and verify status
gbase_ctl start -D /data/gbase8c/dn2
gs_om -t status --detail   # Confirm node Normal, shards synced

# 7. Test affected shards to ensure business continuity
SELECT * FROM order WHERE shard_id IN (xxx, xxx);
Enter fullscreen mode Exit fullscreen mode

Important: Ensure hardware is healthy before restore; prohibit writes during recovery; log recovery must target the exact failure moment for consistency.

Scenario 2: Accidental Table Drop

Symptom: DROP TABLE user executed by mistake; need to recover the table with RPO < 5 minutes.

Steps:

# 1. Identify the exact drop time from logs; the recovery end time should be 1-2 seconds earlier
grep "DROP TABLE user" /GBase_HOME/log/gbase-xxxx.log

# 2. Create a temporary restore directory to avoid overwriting live data
mkdir -p /data/gbase8c/temp_restore

# 3. Restore full backup to the temporary directory
gs_basebackup -R -D /data/gbase8c/temp_restore \
  -h 192.168.1.100 \
  -p 5432 \
  -U backup_user \
  -F c \
  -f /backup/gbase8c/full/20260330/full_backup.tar

# 4. Apply WAL logs up to the moment just before the drop (e.g., 10:29:59)
pg_waldump /backup/gbase8c/wal/20260331/ \
  --start-time '2026-03-31 00:00:00' \
  --end-time '2026-03-31 10:29:59' \
  -f /data/gbase8c/temp_restore/wal_restore.sql

# 5. Export the mistakenly deleted table data from the temporary directory
gbase -U backup_user -d gbase -c "COPY (SELECT * FROM user) TO '/data/gbase8c/temp_restore/user_data.csv' WITH CSV;"

# 6. Import the data back to the live cluster
gbase -U backup_user -d gbase -c "COPY user FROM '/data/gbase8c/temp_restore/user_data.csv' WITH CSV;"

# 7. Verify data integrity
SELECT COUNT(*) FROM user;
SELECT * FROM user LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Important: Always use a temporary directory to avoid overwriting current data; the recovery timestamp must be slightly before the mistake; validate row counts and content after import.

Scenario 3: Full Cluster Crash (e.g., power outage corrupting all nodes)

Symptom: Entire cluster down, all node data damaged; must recover from backups with RTO ≤ 30 minutes.

Steps:

# 1. Ensure all hardware is healthy and network is operational
systemctl start network
systemctl start firewalld   # If rules are configured

# 2. Recover CN node first (coordinator must be ready before DNs)
gbase_ctl stop -D /data/gbase8c/cn   # Stop if already started
rm -rf /data/gbase8c/cn/*            # Purge corrupted data
gs_basebackup -R -D /data/gbase8c/cn \
  -h 192.168.1.100 \
  -p 5432 \
  -U backup_user \
  -F c \
  -f /backup/gbase8c/full/20260330/full_backup.tar
gbase_ctl start -D /data/gbase8c/cn

# 3. Recover all DN nodes one by one (example: dn1~dn4)
for dn in dn1 dn2 dn3 dn4; do
  gbase_ctl stop -D /data/gbase8c/$dn
  rm -rf /data/gbase8c/$dn/*
  gs_basebackup -R -D /data/gbase8c/$dn \
    -h 192.168.1.100 \
    -p 5432 \
    -U backup_user \
    -F c \
    -f /backup/gbase8c/full/20260330/full_backup.tar
  gbase_ctl start -D /data/gbase8c/$dn
done

# 4. Stop the cluster, apply WAL logs up to the failure time
gs_om -t stop
pg_waldump /backup/gbase8c/wal/20260331/ \
  --start-time '2026-03-31 00:00:00' \
  --end-time '2026-03-31 14:00:00'   # Failure time
gs_om -t start

# 5. Verify cluster status and data consistency
gs_om -t status --detail   # All nodes should be Normal
gs_sync_check              # Shard synchronisation check
SELECT COUNT(*) FROM order; # Compare with pre‑failure count
Enter fullscreen mode Exit fullscreen mode

Important: Strictly follow “CN first, then DNs”; all nodes must have network connectivity; perform full business testing after recovery.

5. Data Security Measures

  • Least privilege: Dedicated backup user; revoke DROP/TRUNCATE from business accounts.
  • Storage: Local + offsite + encryption; regularly purge expired backups.
  • Audit & monitoring: Enable audit logs; monitor backup failures and dangerous operations in real time.
  • Regular drills: Quarterly restore exercises covering single node, full cluster, and accidental drop scenarios.
  • Hardware redundancy: Use RAID, monitor environment, replace ageing hardware proactively.

6. Common Pitfalls and Correct Practices

Pitfall Correct Practice
Only full backups Combine full + incremental + WAL
Backups stored on the cluster itself Independent storage with offsite copy
Skipping backup validation Validate every backup immediately
Restoring without consistency checks Restore CN first, then DNs; verify shards
Over‑privileged backup user Least privilege, regular audits
Never practicing restores Quarterly drills to optimise the process

All commands, strategies, and recovery procedures presented here are battle‑tested in production gbase database environments. Apply them directly to safeguard your GBASE deployment.

Top comments (0)