DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8a Sync Internals and GVR Enhancements

Data synchronization is essential for consistency and high availability in distributed databases. GBase 8a, GBASE's China-domestically developed MPP database, builds its sync mechanism on columnar storage units and dedicated shard‑level tools, then layers on GBase Visio Rsynctool (GVR) for additional performance and manageability.

1. How GBase 8a Synchronization Works

Storage Fundamentals: DCs and Metadata

GBase 8a uses columnar storage with block management as the foundation for sync:

  • DataCell (DC): Each column is vertically split into DCs, each holding 64K rows of data. A DC is the basic unit of I/O operations.
  • DC File (seg file): Multiple DCs are packed into a single seg file, which auto‑splits at 2 GB to avoid performance degradation from very large files.
  • Metadata File: Acts as the "navigation map" for sync, recording each DC's SCN (System Change Number for data versioning), seg file location, offset, and row count. This is the core data used to detect differences and perform precise synchronization.

Shard Sync Tools: sync_client and sync_server

  • sync_server: Runs continuously, listens on port 5288 by default, accepts connections from sync_clients, and can handle multiple clients simultaneously.
  • sync_client: Launched by the node that needs synchronization; actively connects to the sync_server on a node holding valid data.

Both components can parse metadata and data files, read and write them, forming the foundation for file‑level comparison and transfer.

Sync Process: Metadata Comparison to Data Overwrite

  1. sync_client connects to sync_server on port 5288 and sends the shard table information to synchronize.
  2. sync_server locates the table's metadata and returns a portion to the client.
  3. sync_client compares the metadata locally, identifies differing DCs, and sends the "diff DC list" back.
  4. sync_server returns the corresponding difference DC data and the full metadata file.
  5. sync_client overwrites the differing DCs and updates metadata locally, completing synchronization.

2. GBase Visio Rsynctool (GVR)

GVR is a purpose‑built tool for GBase 8a MPP Cluster inter‑cluster synchronization, offering visual management, task scheduling, and performance optimizations.

Architecture

  • Frontend: Visual web console and API.
  • Backend: Authentication, authorization, sync configuration, data source management, job scheduling, logging, and configuration storage.
  • Underlying scripts: Connect source and target clusters via configuration files and sync scripts for efficient data transfer.

Pre-Sync Engine: Narrow the Sync Scope

Based on SCN (System Change Number) comparison, GVR only synchronizes tables that differ between the primary and standby clusters:

  • A getscn service is installed on nodes in both clusters to collect table SCNs.
  • GVR loads the collected data into a staging table, compares it, and identifies the inconsistent tables.
  • Pre‑sync can be enabled or disabled; it's optional for small data volumes or small node counts. getscn supports visual management.

Split Scheduling Engine: Break Through Performance Bottlenecks

Addressing the performance degradation and memory bloat of a single long‑running rsynctool process, GVR introduces split scheduling:

  • Each sync job is divided into multiple rsynctool sub‑jobs, each handling 10 tables before terminating and releasing resources.
  • This delivers a 2–3x performance gain, with larger sync workloads seeing bigger improvements.

By combining optimized locking with pre‑sync and split scheduling, GVR significantly boosts the efficiency and availability of GBase 8a in active‑active and large‑scale data synchronization scenarios, keeping a gbase database consistent without sacrificing read‑write performance.

Top comments (0)