DEV Community

Maya S.
Maya S.

Posted on

Migrate the legacy Greenplum to Apache Cloudberry with cbcopy

In the field of data warehousing and big data analytics, Greenplum Database has long been recognized as a leading open-source Massively Parallel Processing (MPP) database. However, since Greenplum transitioned to a closed-source model, users have increasingly encountered limitations in areas such as version upgrades, bug fixes, and feature extensions. Against this backdrop, Apache Cloudberry emerged.

As an open-source derivative of Greenplum, Apache Cloudberry is highly compatible with Greenplum’s architecture and SQL syntax while providing comprehensive enhancements in functionality, performance, and security. Apache Cloudberry has quickly become the most promising open-source alternative to Greenplum.

Beyond supporting more efficient parallel query execution and advanced resource management, the Apache Cloudberry community also introduces a dedicated data loading and migration tool — cbcopy — which enables seamless and highly efficient migration from Greenplum to Cloudberry.

This article provides an in-depth overview of cbcopy — its features, internal mechanisms, and practical usage — followed by a complete case study demonstrating how to perform a fast and seamless migration from Greenplum to Apache Cloudberry.

Introduction to cbcopy

cbcopy is a data migration utility designed to transfer data across different database clusters. It can quickly replicate both metadata and actual data from a Greenplum cluster to an Apache Cloudberry cluster. The tool supports full migration of database objects, including schemas, tables, indexes, views, roles, user-defined functions, resource queues, and resource groups.

Supported Levels of Migration

cbcopy supports four levels of database object migration:

  • Cluster-level migration – migrates the entire source cluster to the target cluster.
  • Database-level migration – migrates a specific database from the source cluster to the target cluster.
  • Schema-level migration – migrates a specified schema within a database from the source cluster to a target database.
  • Table-level migration – migrates specific tables from the source cluster to the target cluster. Support for Different Cluster Scales cbcopy can handle migrations between clusters with different numbers of compute nodes (segments):
  • The source and target clusters have the same number of segments.
  • The source cluster has fewer segments than the target cluster.
  • The source cluster has more segments than the target cluster.

cbcopy Mechanism

The cbcopy utility is implemented using the COPY ON SEGMENT TO PROGRAM mechanism and the external table feature. It employs data compression during transmission to reduce network resource usage and uses checksum verification to ensure data consistency between clusters.

To maximize migration performance in distributed environments, cbcopy automatically applies one of two migration strategies depending on the size of the source table:

  • Small tables (default threshold: fewer than 1,000,000 rows): Data is transferred via a direct connection between the source cluster’s master node and the target cluster’s coordinator node.
  • Large tables (default threshold: more than 1,000,000 rows):

cbcopy launches helper processes on both the source and target cluster segments. These helpers establish direct connections and perform parallel data transfers between segments to achieve higher throughput.

cbcopy in Practice

In this section, we demonstrate how to use cbcopy to migrate data from a Greenplum 6 database cluster to an Apache Cloudberry cluster in a test environment.

  1. Test Environment Source: Greenplum Cluster | IP Address | Configuration | Version | Role | | :--- | :--- | :--- | :--- | | 192.168.194.55 | 4C / 16GB | Greenplum 6.27.1 | Master | | 192.168.197.120 | 4C / 16GB | Greenplum 6.27.1 | Segment | | 192.168.192.215 | 4C / 16GB | Greenplum 6.27.1 | Segment | Target: Apache Cloudberry Cluster | IP Address | Configuration | Version | Role | | :--- | :--- | :--- | :--- | | 192.168.194.137 | 4C / 16GB | Apache Cloudberry 2.0.0 | Coordinator | | 192.168.192.93 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment | | 192.168.196.69 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment |

Test Data

The warehouse database in the Greenplum 6 cluster contains simulated “banking data warehouse” test data, including both transactional and historical tables.

IP Address Configuration Version Role
192.168.194.137 4C / 16GB Apache Cloudberry 2.0.0 Coordinator
192.168.192.93 4C / 16GB Apache Cloudberry 2.0.0 Segment
192.168.196.69 4C / 16GB Apache Cloudberry 2.0.0 Segment
## 2. Migration Preparation

Before running cbcopy, verify that the file `/usr/local/greenplum-db/bin/gpcopy_helper` exists on each Greenplum 6 node.

If it does not exist, copy the cbcopy_helper binary from the Cloudberry cluster to the Greenplum nodes as follows:

**On the Cloudberry coordinator node (192.168.194.137):**
Enter fullscreen mode Exit fullscreen mode


bash
su - root
scp /usr/local/cloudberry-db/bin/cbcopy_helper 192.168.194.55:/usr/local/greenplum-db/bin/


**On the Greenplum 6 master node (192.168.194.55):**
Enter fullscreen mode Exit fullscreen mode


bash
su - root
source /usr/local/greenplum-db/greenplum_path.sh
cd /usr/local/greenplum-db/bin
gpscp -f /home/gpadmin/hostfile_all cbcopy_helper =:$PWD/


---

## 3. Data Migration

Execute the cbcopy command from the Apache Cloudberry coordinator node to migrate data. cbcopy supports full cluster, database-level, schema-level, or table-level migration.

Migration logs are saved under `/home/gpadmin/gpAdminLogs` on the execution node.

### 3.1 Full Cluster Migration

Migrate all databases (in this case, dw and warehouse) from the Greenplum 6 cluster to the Cloudberry cluster:

Enter fullscreen mode Exit fullscreen mode


bash
export PGPASSWORD=gpadmin
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \
--dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \
--full --truncate --compression


### 3.2 Database-Level Migration

Migrate only the warehouse database:

Enter fullscreen mode Exit fullscreen mode


bash
export PGPASSWORD=gpadmin
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \
--dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \
--dbname="warehouse" --truncate --compression


### 3.3 Schema-Level Migration

Migrate the sh1 schema in the warehouse database:

Enter fullscreen mode Exit fullscreen mode


bash
export PGPASSWORD=gpadmin
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \
--dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \
--truncate --compression --schema=warehouse.sh1


### 3.4 Table-Level Migration

Migrate the warehouse.public.cancel_accounts table:

Enter fullscreen mode Exit fullscreen mode


bash
export PGPASSWORD=gpadmin
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \
--dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \
--truncate --compression --include-table="warehouse.public.cancel_accounts"


### 3.5 Common Parameters

| Parameter | Description |
| :--- | :--- |
| `--source-host` | Hostname or IP address of the source database master |
| `--source-port` | Port number of the source master |
| `--source-user` | User ID for connecting to the source database |
| `--dest-host` | Hostname or IP address of the target coordinator |
| `--dest-port` | Port number of the target coordinator |
| `--dest-user` | User ID for connecting to the target database |
| `--full` | Migrate the entire cluster; cannot be used with `--dbname`, `--include-table`, etc. |
| `--dbname` | Comma-separated list of source databases to copy |
| `--schema` | Comma-separated list of schemas to copy (format: `database.schema`) |
| `--include-table` | Comma-separated list of tables to copy (format: `database.schema.table`) |
| `--metadata-only` | Copy only metadata (DDL) without data |
| `--data-only` | Copy only data, excluding metadata |
| `--on-segment-threshold` | Row-count threshold for enabling segment-level parallel copy (default: 1,000,000) |
| `--truncate` | Truncate existing target tables before copy |
| `--append` | Append data to existing target tables |
| `--copy-jobs` | Number of parallel copy processes (default: 4) |
| `--compression` | Enable compression during data transfer |


## 4. Post-Migration Validation

After migration, compare the warehouse database objects (tables, indexes, views, sequences, and functions) between the Greenplum 6 source and the Apache Cloudberry target to verify data and metadata integrity.

### Object Validation (Tables, Indexes, Views, Sequences)

**1. Source: Greenplum 6**
Enter fullscreen mode Exit fullscreen mode


sql
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner", CASE c.relstorage WHEN 'h' THEN 'heap' WHEN 'x' THEN 'external' WHEN 'a' THEN 'append only' WHEN 'v' THEN 'none' WHEN 'c' THEN 'append only columnar' WHEN 'p' THEN 'Apache Parquet' WHEN 'f' THEN 'foreign' END as "Storage"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','v','m','S','f','')
AND c.relstorage IN ('h', 'a', 'c','x','f','v','')
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;


**2. Target: Apache Cloudberry**
Enter fullscreen mode Exit fullscreen mode


sql
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'd' THEN 'directory table' WHEN 'v' THEN 'view' WHEN 'm' THEN CASE c.relisdynamic WHEN true THEN 'dynamic table' ELSE 'materialized view' END WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
CASE a.amname WHEN 'ao_column' THEN 'append only columnar' WHEN 'ao_row' THEN 'append only' ELSE a.amname END as "Storage"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam
LEFT JOIN pg_catalog.pg_am a ON a.oid = c.relam
WHERE c.relkind IN ('r','p','v','m','S','f','')
AND n.nspname <> 'pg_catalog'
AND n.nspname !~ '^pg_toast'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;


### User-Defined Function Validation

**1. Source: Greenplum 6**
Enter fullscreen mode Exit fullscreen mode


sql
SELECT n.nspname as "Schema",
p.proname as "Name",
pg_catalog.pg_get_function_result(p.oid) as "Result data type",
pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
CASE
WHEN p.proisagg THEN 'agg'
WHEN p.proiswindow THEN 'window'
WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
ELSE 'func'
END as "Type"
FROM pg_catalog.pg_proc p
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;


**2. Target: Apache Cloudberry**
Enter fullscreen mode Exit fullscreen mode


sql
SELECT n.nspname as "Schema",
p.proname as "Name",
pg_catalog.pg_get_function_result(p.oid) as "Result data type",
pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
CASE p.prokind
WHEN 'a' THEN 'agg'
WHEN 'w' THEN 'window'
WHEN 'p' THEN 'proc'
ELSE 'func'
END as "Type"
FROM pg_catalog.pg_proc p
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;




### cbcopy Parameters Reference

(Refer to the cbcopy parameter documentation and examples for complete usage and configuration guidance.)



## Welcome to Apache Cloudberry:

- **Visit the website:** https://cloudberry.apache.org
- **Follow us on GitHub:** https://github.com/apache/cloudberry
- **Join Slack workspace:** https://apache-cloudberry.slack.com
- **Dev mailing list:**
  - To subscribe to dev mailing list: Send an email to `dev-subscribe@cloudberry.apache.org`
  - To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

Enter fullscreen mode Exit fullscreen mode

Top comments (0)