DEV Community: Maya S.

Apache Cloudberry 2.0: Rebuilding Storage for the Cloud-Native Era with PAX

Maya S. — Fri, 20 Mar 2026 08:37:44 +0000

Rethinking AOCS: When Architecture Meets a New Infrastructure Reality

From a Solid Design to a Structural Mismatch

The AO/AOCS storage engine, inherited from Greenplum, was originally built for on-premises environments. Its design—column-per-file with append-only writes—worked well on block storage and traditional file systems, delivering stable performance for OLAP workloads.
But the infrastructure landscape has changed.
As storage shifts toward cloud-native object storage, the assumptions behind AOCS no longer hold. Object storage favors large, sequential I/O and request aggregation, while AOCS relies on independent column files and frequent small appends. The result is not just inefficiency—it is a structural mismatch.
In real-world workloads, this manifests as:

Exploding request counts when scanning wide tables (one request per column per file)
Severe request amplification due to unmerged small writes
Degraded sequential read performance caused by fragmented column layouts At the same time, tight kernel coupling and limited thread-safety make it difficult to fully leverage multi-threading and vectorized execution. What used to be a reasonable design has now become a constraint—not just on performance, but on the system’s ability to evolve. Why Incremental Fixes Were Not Enough Extensive stress testing revealed a clear pattern: the bottleneck was not localized—it was systemic. Tuning parameters, improving caches, or adding execution-layer optimizations helped, but only marginally. The core issue remained: the storage model itself was not aligned with the cloud environment. Continuing to patch AOCS would only introduce more layers of complexity and technical debt. The conclusion was straightforward: instead of adapting a legacy design, Cloudberry needed a storage engine built for object storage from the ground up. This led to the introduction of PAX.

PAX: A Storage Model Designed for the Cloud

PAX is not just a replacement for AOCS. It is a redefinition of how storage should work in a cloud-native data warehouse—balancing analytical performance, transactional needs, and long-term evolvability.
A New Paradigm: Row–Column Co-existence
Traditional database systems force a trade-off:

Row storage → optimized for transactions
Column storage → optimized for analytics PAX removes this dichotomy. Within the same physical file and logical block, PAX organizes data in a columnar layout while preserving row-level access semantics. This hybrid design enables:
Efficient analytical scans by reading only required columns
Merged multi-column writes to reduce small-file pressure on object storage
Shared file structures across columns, significantly reducing request overhead The result is a storage model that performs consistently across mixed OLTP + OLAP workloads, which is increasingly common in modern data platforms.

A Layered Architecture Built for Evolution

PAX adopts a strictly layered design to ensure modularity and long-term extensibility:

Access Handler Layer Integrates with Cloudberry’s Access Method (AM), handling transactions and lifecycle management.
Table Layer Bridges execution engines and storage, supporting both row-based and vectorized execution.
MicroPartition Layer Manages physical data organization (files and stripes), including statistics and pruning logic.
Column Layer Defines in-memory column structures, handling encoding, decoding, and alignment.
File Layer Encapsulates storage interactions, including data files, metadata, and visibility maps. This separation of concerns allows PAX to evolve independently at each layer, paving the way for features like multi-threaded execution and distributed transactions.

Metadata Management: A Lightweight Control Plane for Storage

PAX adopts a lightweight yet effective metadata management strategy based on auxiliary tables built on the Heap Access Method (Heap AM).

Each physical data file corresponds to a single record in the auxiliary table. This mapping provides a consistent control plane for storage, enabling the engine to:

Quickly locate data files
Track file lifecycle changes
Evaluate transactional visibility efficiently The auxiliary table maintains essential metadata such as file identifiers, states, and visibility-related attributes, ensuring that storage operations remain both predictable and low-overhead. In addition, PAX maintains a global fast sequence table to generate unique BLOCKNAMEs, guaranteeing globally unique file naming across nodes and transactions. More importantly, this mechanism serves as the foundation for associating Visimap files with their corresponding data files, ensuring correctness and consistency in distributed visibility control.

Rethinking MVCC for Object Storage
Traditional MVCC in PostgreSQL relies on row-level versioning. In object storage, this approach becomes prohibitively expensive due to excessive I/O and metadata operations.
PAX introduces a file-level visibility model.
Instead of tracking visibility per row, PAX uses Visimap files to represent visibility at the file level:
.visimap
This enables:

Lock-free concurrent reads
Minimal metadata overhead
Efficient visibility checks at read time It’s a fundamental shift that aligns concurrency control with the realities of object storage.

PORC_VEC: When Storage Becomes Execution

One of the most impactful innovations in PAX is PORC_VEC (PostgreSQL ORC Vectorized).
In traditional systems, data must be transformed into a vectorized format before execution—incurring CPU and memory overhead. PORC_VEC eliminates this step entirely.
Key characteristics:

Zero-copy reads: data is consumed directly by the execution engine
Cache-aligned layout: optimized for modern CPU architectures
Unified metadata model: aligned with in-memory column structures This leads to a powerful principle: The storage format is the execution format. In internal tests, PORC_VEC reduces CPU usage by ~20% and improves query throughput by 15–25%.

Column Layer: Bridging Storage and Execution

The Column layer serves as the core in-memory abstraction for columnar data in PAX, bridging persistent storage and the execution engine.

It is responsible for both data representation and transformation, with a design centered on efficiency, flexibility, and alignment with vectorized execution:

Disk-to-Memory Mapping Column loads column data from disk into memory and flushes in-memory data back to storage during write operations.
Format Transformation It performs efficient format conversion along read and write paths, ensuring consistency between on-disk and in-memory layouts while minimizing overhead.
Encoding and Compression Multiple techniques—such as RLEv2, dictionary encoding, and ZSTD—are integrated to reduce storage footprint without sacrificing query performance.
Flexible Access Interfaces
- Row-level interfaces for transactional workloads
- Batch-oriented interfaces for analytical and vectorized execution
Memory Alignment and Complex Type Optimization
- Memory layout follows CPU cache alignment principles to improve access efficiency
- Complex types (e.g., arrays and range types) adopt independent alignment and offset control to reduce parsing overhead With these design choices, the Column layer balances performance, memory efficiency, and concurrency scalability, while providing a solid foundation for vectorized execution and parallel scanning.

Performance Foundations: Four Key Mechanisms

PAX’s performance gains are not accidental—they are the result of deliberate architectural choices.

Sparse Filtering
By maintaining min/max statistics and Bloom Filters at file and stripe levels, PAX can aggressively prune irrelevant data.
Example:
A query like WHERE age < 18 skips entire data blocks where min(age) > 18.
This reduces I/O requests by over 60% on average, bringing object storage performance closer to in-memory systems.
Intelligent Physical Layout (Cluster)
PAX aligns physical data layout with query patterns through automatic clustering:
Z-Order → optimized for multi-dimensional range queries
Lexical Order → optimized for multi-column filtering
This improves data locality and significantly reduces random I/O.
Modern Memory Management
PAX evolved through three stages of memory management, ultimately adopting:
Smart pointers (unique_ptr, shared_ptr)
Thread-aware resource management
This ensures:
No memory leaks under high concurrency
Safe cleanup during early exits or failures
Stable behavior in multi-threaded vectorized execution

Benchmark Results: Quantifying the Gains

In 1TB TPC-H and TPC-DS benchmarks:

Average performance improvement: 15%–25%
Complex queries (joins, aggregations): up to 40% faster These gains come from:
Reduced I/O amplification
Lower CPU overhead via zero-copy execution
More stable latency under complex workloads

Closing Thoughts: Engineering for the Real World

PAX reflects a deliberate shift in engineering philosophy:
Not optimizing around constraints—but removing them.
By aligning storage design with object storage characteristics, and tightly integrating execution with data format, PAX establishes a foundation that is both high-performance and future-proof.
Looking ahead, Cloudberry will continue to evolve PAX with:

Delta storage for incremental updates
Deeper optimizer integration
SIMD-accelerated execution
Adaptive, self-tuning statistics All with a single goal: to make Cloudberry a truly，continuously evolving data platform. Welcome to Apache Cloudberry:
Visit the website: https://cloudberry.apache.org
Follow us on GitHub: https://github.com/apache/cloudberry
Join Slack workspace: https://apache-cloudberry.slack.com
Dev mailing list:
- To subscribe to dev mailing list: Send an email to dev-subscribe@cloudberry.apache.org
- To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

Rethinking Stream-Batch Unification: Real-Time Processing with Incremental Materialized Views in Apache Cloudberry

Maya S. — Tue, 16 Dec 2025 09:34:04 +0000

Apache Cloudberry is an advanced and mature open-source Massively Parallel Processing (MPP) database, derived from the open-source version of the Pivotal Greenplum Database® but built on a more modern PostgreSQL kernel and with more advanced enterprise capabilities. Cloudberry can serve as a data warehouse and can also be used for large-scale analytics and AI/ML workloads.

In today’s data-driven landscape, “real-time” capabilities have become a business imperative. Every company wants to detect changes instantly and respond to user needs as they happen. Streaming engines such as Apache Flink, with their powerful capabilities and ultra-low latency, have set a compelling vision for what real-time data processing can achieve.
Yet the reality is often far more complicated. For many organizations — especially those without large, specialized engineering teams — building and maintaining a Flink-based, stream-batch unified platform can be both powerful and painful. You gain real-time insights, but only by accepting significant architectural complexity and operational overhead.
Is there a simpler, more elegant path to stream-batch unification?
Yes — and it has become increasingly practical.
With the rise of modern database technologies, solutions such as Incremental Materialized Views (IVM) in Apache Cloudberry are emerging as a cleaner, lighter alternative: in-database stream processing.

The “Heavyweight” Approach: The Power and Pain of Flink
A Flink-centered architecture is undoubtedly powerful, but it also comes with several burdens:

Complex architecture and high operational costs A typical pipeline stitches together many components — applications, MySQL, CDC tools, Kafka, Flink, and a data warehouse or data lake. Each component requires specialized expertise, and a failure in any part can break the entire chain.
High development overhead In the classic Lambda architecture, teams must maintain two separate codebases — one for streaming (Flink) and one for batch (Spark or Hive). That means double the logic, double the testing, and a persistent risk of inconsistency.
Steep learning curve Mastering Flink is non-trivial. State management, time semantics, watermarks, windowing, and performance tuning demand deep expertise and continuous operational effort — something many teams cannot afford.

Simplifying Stream-Batch Processing Inside the Database
Cloudberry takes a bold yet simple approach:
Why not let the database itself handle streaming computation?
This is the essence of in-database stream-batch unification, powered by Incremental Materialized Views (IVM).
An IVM functions as a “live” materialized result that automatically stays up to date.

Batch phase: When you run a CREATE INCREMENTAL MATERIALIZED VIEW command, Cloudberry performs a full historical computation to build the initial view — the batch layer.
Stream phase: Subsequent INSERT, UPDATE, and DELETE operations on the source tables are captured automatically. The engine computes only the incremental changes and updates the view in near real time — typically within milliseconds to seconds. This fundamentally simplifies what used to be a complex and error-prone workflow. Previously, teams had to define Kafka message schemas and Flink-specific data structures, and write large amounts of Flink SQL (covering data sources, windows, aggregations, dimension joins, and output tables) just to complete a single task. For example: // Kafka data structure { "sales_id": 8435, "event_type": "+I", "event_time": "2025-06-27 07:53:21Z", "ticket_number": 8619628, "item_sk": 6687, "customer_sk": 69684, "store_sk": 238, "quantity": 6, "sales_price": 179.85, "ext_sales_price": 1079.1, "net_profit": 672, "event_source": "CDC-TO-KAFKA-FIXED" } Before Flink can process streaming data, the data must be persisted to ensure correctness and support replay in case of failures. Therefore, CDC → Kafka → Flink always introduces additional transformation, configuration, and operational complexity. The following Flink SQL illustrates only the streaming computation portion — the full pipeline requires even more components and code: -- Create the TPC-DS store performance aggregation result output table (output to console) CREATE TABLE store_daily_performance ( window_start TIMESTAMP(3), window_end TIMESTAMP(3), s_store_sk INT, s_store_name STRING, s_state STRING, s_market_manager STRING, sale_date STRING, total_sales_amount DECIMAL(10,2), total_net_profit DECIMAL(10,2), total_items_sold BIGINT, transaction_count BIGINT, avg_sales_price DECIMAL(7,2), process_time TIMESTAMP_LTZ(3) ) WITH ( 'connector'='print', 'print-identifier'='TPCDS-STORE-PERFORMANCE' );

-- Core aggregation query
INSERT INTO store_daily_performance
SELECT
window_start,
window_end,
s.ss_store_sk,
COALESCE(sd.s_store_name, CONCAT('Store #', CAST(s.ss_store_sk AS STRING))) AS s_store_name,
COALESCE(sd.s_state, 'Unknown') AS s_state,
COALESCE(sd.s_market_manager, 'Unknown Manager') AS s_market_manager,
DATE_FORMAT(window_start, 'yyyy-MM-dd') AS sale_date,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_ext_sales_price
WHEN s.event_type = '-D' THEN -s.ss_ext_sales_price
ELSE 0 END) AS total_sales_amount,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_net_profit
WHEN s.event_type = '-D' THEN -s.ss_net_profit
ELSE 0 END) AS total_net_profit,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_quantity
WHEN s.event_type = '-D' THEN -s.ss_quantity
ELSE 0 END) AS total_items_sold,
COUNT(DISTINCT s.ss_ticket_number) AS transaction_count,
AVG(s.ss_sales_price) AS avg_sales_price,
LOCALTIMESTAMP AS process_time
FROM TABLE(
TUMBLE(TABLE sales_events_source, DESCRIPTOR(event_time), INTERVAL '1' MINUTE)
) s
LEFT JOIN store_dim sd ON s.ss_store_sk = sd.s_store_sk
WHERE s.event_type IN ('+I', '-D', 'U')
GROUP BY
window_start,
window_end,
s.ss_store_sk,
sd.s_store_name,
sd.s_state,
sd.s_market_manager;
By contrast, Cloudberry IVM can express the same task in a single SQL statement:
CREATE INCREMENTAL MATERIALIZED VIEW tpcds.store_daily_performance_enriched_ivm
AS
SELECT
ss.ss_store_sk AS store,
s.s_store_name AS store_name,
s.s_state AS state,
s.s_market_manager AS manager,
d.d_date AS sold_date,
SUM(ss.ss_net_paid_inc_tax) AS total_sales_amount,
SUM(ss.ss_net_profit) AS total_net_profit,
SUM(ss.ss_quantity) AS total_items_sold,
COUNT(ss.ss_ticket_number) AS transaction_count
FROM
tpcds.store_sales_heap ss
JOIN
tpcds.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
JOIN
tpcds.store s ON ss.ss_store_sk = s.s_store_sk
GROUP BY
ss.ss_store_sk,
s.s_store_name,
s.s_state,
s.s_market_manager,
d.d_date
DISTRIBUTED BY (ss_store_sk);
All the complexity — state management, consistency handling, incremental computation, scheduling, and triggering — is transparently handled by the database engine. This eliminates the need to orchestrate numerous intermediate streaming jobs and significantly reduces development and operational costs.

The Perfect Pair: Incremental Materialized Views and Dynamic Tables
Cloudberry also provides another mechanism: Dynamic Tables.
While both are types of materialized views, they serve different purposes depending on latency requirements and workload characteristics.
In short:

Choose IVM when you need low latency and immediate updates.
Choose Dynamic Tables when you can tolerate some delay and need to handle large datasets efficiently.

Practical Considerations: Performance and Limitations
No technology is perfect, and IVM is no exception.

Performance overhead: Because IVMs update incrementally on every write, they add some transactional overhead to source tables — especially when multiple IVMs depend on the same table.
Feature limitations: The current version of Cloudberry IVM does not yet support MIN, MAX, window functions, LEFT/OUTER JOIN, CTEs, or partitioned tables. These gaps are actively being addressed by the open-source community.

Conclusion
For top-tier internet companies, investing in large, Flink-based infrastructures makes sense — they can absorb the complexity in pursuit of maximum flexibility and performance.
But most organizations do not need heavyweight systems. They need a simple, reliable, and cost-effective way to gain real-time insights.
Cloudberry’s Incremental Materialized Views provide exactly that:a unified stream-batch processing model built directly into the database, powered by plain SQL, with consistency, simplicity, and efficiency in a single system.
This may well be the most practical path to bringing real-time data capabilities to every enterprise.

Welcome to Apache Cloudberry:

Visit the website: https://cloudberry.apache.org
Follow us on GitHub: https://github.com/apache/cloudberry
Join Slack workspace: https://apache-cloudberry.slack.com
Dev mailing list:
- To subscribe to dev mailing list: Send an email to dev-subscribe@cloudberry.apache.org
- To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

Migrate the legacy Greenplum to Apache Cloudberry with cbcopy

Maya S. — Tue, 16 Dec 2025 09:33:15 +0000

In the field of data warehousing and big data analytics, Greenplum Database has long been recognized as a leading open-source Massively Parallel Processing (MPP) database. However, since Greenplum transitioned to a closed-source model, users have increasingly encountered limitations in areas such as version upgrades, bug fixes, and feature extensions. Against this backdrop, Apache Cloudberry emerged.

As an open-source derivative of Greenplum, Apache Cloudberry is highly compatible with Greenplum’s architecture and SQL syntax while providing comprehensive enhancements in functionality, performance, and security. Apache Cloudberry has quickly become the most promising open-source alternative to Greenplum.

Beyond supporting more efficient parallel query execution and advanced resource management, the Apache Cloudberry community also introduces a dedicated data loading and migration tool — cbcopy — which enables seamless and highly efficient migration from Greenplum to Cloudberry.

This article provides an in-depth overview of cbcopy — its features, internal mechanisms, and practical usage — followed by a complete case study demonstrating how to perform a fast and seamless migration from Greenplum to Apache Cloudberry.

Introduction to cbcopy

cbcopy is a data migration utility designed to transfer data across different database clusters. It can quickly replicate both metadata and actual data from a Greenplum cluster to an Apache Cloudberry cluster. The tool supports full migration of database objects, including schemas, tables, indexes, views, roles, user-defined functions, resource queues, and resource groups.

Supported Levels of Migration

cbcopy supports four levels of database object migration:

Cluster-level migration – migrates the entire source cluster to the target cluster.
Database-level migration – migrates a specific database from the source cluster to the target cluster.
Schema-level migration – migrates a specified schema within a database from the source cluster to a target database.
Table-level migration – migrates specific tables from the source cluster to the target cluster. Support for Different Cluster Scales cbcopy can handle migrations between clusters with different numbers of compute nodes (segments):
The source and target clusters have the same number of segments.
The source cluster has fewer segments than the target cluster.
The source cluster has more segments than the target cluster.

cbcopy Mechanism

The cbcopy utility is implemented using the COPY ON SEGMENT TO PROGRAM mechanism and the external table feature. It employs data compression during transmission to reduce network resource usage and uses checksum verification to ensure data consistency between clusters.

To maximize migration performance in distributed environments, cbcopy automatically applies one of two migration strategies depending on the size of the source table:

Small tables (default threshold: fewer than 1,000,000 rows): Data is transferred via a direct connection between the source cluster’s master node and the target cluster’s coordinator node.
Large tables (default threshold: more than 1,000,000 rows):

cbcopy launches helper processes on both the source and target cluster segments. These helpers establish direct connections and perform parallel data transfers between segments to achieve higher throughput.

cbcopy in Practice

In this section, we demonstrate how to use cbcopy to migrate data from a Greenplum 6 database cluster to an Apache Cloudberry cluster in a test environment.

Test Environment Source: Greenplum Cluster | IP Address | Configuration | Version | Role | | :--- | :--- | :--- | :--- | | 192.168.194.55 | 4C / 16GB | Greenplum 6.27.1 | Master | | 192.168.197.120 | 4C / 16GB | Greenplum 6.27.1 | Segment | | 192.168.192.215 | 4C / 16GB | Greenplum 6.27.1 | Segment | Target: Apache Cloudberry Cluster | IP Address | Configuration | Version | Role | | :--- | :--- | :--- | :--- | | 192.168.194.137 | 4C / 16GB | Apache Cloudberry 2.0.0 | Coordinator | | 192.168.192.93 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment | | 192.168.196.69 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment |

Test Data

The warehouse database in the Greenplum 6 cluster contains simulated “banking data warehouse” test data, including both transactional and historical tables.

IP Address	Configuration	Version	Role
192.168.194.137	4C / 16GB	Apache Cloudberry 2.0.0	Coordinator
192.168.192.93	4C / 16GB	Apache Cloudberry 2.0.0	Segment
192.168.196.69	4C / 16GB	Apache Cloudberry 2.0.0	Segment

## 2. Migration Preparation

Before running cbcopy, verify that the file `/usr/local/greenplum-db/bin/gpcopy_helper` exists on each Greenplum 6 node.

If it does not exist, copy the cbcopy_helper binary from the Cloudberry cluster to the Greenplum nodes as follows:

**On the Cloudberry coordinator node (192.168.194.137):**

bash
su - root
scp /usr/local/cloudberry-db/bin/cbcopy_helper 192.168.194.55:/usr/local/greenplum-db/bin/


**On the Greenplum 6 master node (192.168.194.55):**

bash
su - root
source /usr/local/greenplum-db/greenplum_path.sh
cd /usr/local/greenplum-db/bin
gpscp -f /home/gpadmin/hostfile_all cbcopy_helper =:$PWD/


---

## 3. Data Migration

Execute the cbcopy command from the Apache Cloudberry coordinator node to migrate data. cbcopy supports full cluster, database-level, schema-level, or table-level migration.

Migration logs are saved under `/home/gpadmin/gpAdminLogs` on the execution node.

### 3.1 Full Cluster Migration

Migrate all databases (in this case, dw and warehouse) from the Greenplum 6 cluster to the Cloudberry cluster:

bash
export PGPASSWORD=gpadmin
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \
--dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \
--full --truncate --compression


### 3.2 Database-Level Migration

Migrate only the warehouse database:


### 3.3 Schema-Level Migration

Migrate the sh1 schema in the warehouse database:


### 3.4 Table-Level Migration

Migrate the warehouse.public.cancel_accounts table:


### 3.5 Common Parameters

| Parameter | Description |
| :--- | :--- |
| `--source-host` | Hostname or IP address of the source database master |
| `--source-port` | Port number of the source master |
| `--source-user` | User ID for connecting to the source database |
| `--dest-host` | Hostname or IP address of the target coordinator |
| `--dest-port` | Port number of the target coordinator |
| `--dest-user` | User ID for connecting to the target database |
| `--full` | Migrate the entire cluster; cannot be used with `--dbname`, `--include-table`, etc. |
| `--dbname` | Comma-separated list of source databases to copy |
| `--schema` | Comma-separated list of schemas to copy (format: `database.schema`) |
| `--include-table` | Comma-separated list of tables to copy (format: `database.schema.table`) |
| `--metadata-only` | Copy only metadata (DDL) without data |
| `--data-only` | Copy only data, excluding metadata |
| `--on-segment-threshold` | Row-count threshold for enabling segment-level parallel copy (default: 1,000,000) |
| `--truncate` | Truncate existing target tables before copy |
| `--append` | Append data to existing target tables |
| `--copy-jobs` | Number of parallel copy processes (default: 4) |
| `--compression` | Enable compression during data transfer |


## 4. Post-Migration Validation

After migration, compare the warehouse database objects (tables, indexes, views, sequences, and functions) between the Greenplum 6 source and the Apache Cloudberry target to verify data and metadata integrity.

### Object Validation (Tables, Indexes, Views, Sequences)

**1. Source: Greenplum 6**

sql
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner", CASE c.relstorage WHEN 'h' THEN 'heap' WHEN 'x' THEN 'external' WHEN 'a' THEN 'append only' WHEN 'v' THEN 'none' WHEN 'c' THEN 'append only columnar' WHEN 'p' THEN 'Apache Parquet' WHEN 'f' THEN 'foreign' END as "Storage"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','v','m','S','f','')
AND c.relstorage IN ('h', 'a', 'c','x','f','v','')
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;


**2. Target: Apache Cloudberry**

sql
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'd' THEN 'directory table' WHEN 'v' THEN 'view' WHEN 'm' THEN CASE c.relisdynamic WHEN true THEN 'dynamic table' ELSE 'materialized view' END WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
CASE a.amname WHEN 'ao_column' THEN 'append only columnar' WHEN 'ao_row' THEN 'append only' ELSE a.amname END as "Storage"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam
LEFT JOIN pg_catalog.pg_am a ON a.oid = c.relam
WHERE c.relkind IN ('r','p','v','m','S','f','')
AND n.nspname <> 'pg_catalog'
AND n.nspname !~ '^pg_toast'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;


### User-Defined Function Validation

**1. Source: Greenplum 6**

sql
SELECT n.nspname as "Schema",
p.proname as "Name",
pg_catalog.pg_get_function_result(p.oid) as "Result data type",
pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
CASE
WHEN p.proisagg THEN 'agg'
WHEN p.proiswindow THEN 'window'
WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
ELSE 'func'
END as "Type"
FROM pg_catalog.pg_proc p
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;


**2. Target: Apache Cloudberry**

sql
SELECT n.nspname as "Schema",
p.proname as "Name",
pg_catalog.pg_get_function_result(p.oid) as "Result data type",
pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
CASE p.prokind
WHEN 'a' THEN 'agg'
WHEN 'w' THEN 'window'
WHEN 'p' THEN 'proc'
ELSE 'func'
END as "Type"
FROM pg_catalog.pg_proc p
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;




### cbcopy Parameters Reference

(Refer to the cbcopy parameter documentation and examples for complete usage and configuration guidance.)



## Welcome to Apache Cloudberry:

- **Visit the website:** https://cloudberry.apache.org
- **Follow us on GitHub:** https://github.com/apache/cloudberry
- **Join Slack workspace:** https://apache-cloudberry.slack.com
- **Dev mailing list:**
  - To subscribe to dev mailing list: Send an email to `dev-subscribe@cloudberry.apache.org`
  - To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

Unlocking the Power of Dynamic Tables: A Thanksgiving Transformation

Maya S. — Fri, 28 Nov 2025 07:41:50 +0000

As Thanksgiving approached, Harvest Analytics was gearing up for its biggest sales event of the year. With customers eager to secure the best deals, the team knew they needed real-time visibility into their sales data to drive informed decisions. But they faced a critical challenge: how could they efficiently query streaming data from Kafka while still getting instant access to key metrics?

The Challenge

Harvest Analytics relied heavily on a lakehouse architecture, using kafka_fdw to pull streaming data into their Apache Cloudberry database. However, queries on the external data were often sluggish, hampering their ability to respond quickly during peak sales periods. The team needed a solution that could deliver fast, auto-refreshing access to this critical information.

[图片]

The Discovery of Dynamic Tables

One day, while discussing their challenges over coffee, Chief Data Officer Lisa recalled a powerful new feature in Cloudberry: Dynamic Tables. These auto-refreshing materialized views could pull data from base tables, external tables, and even other dynamic tables, automatically optimizing query performance.
Excited by the potential, Lisa gathered her team to explore how Dynamic Tables could revolutionize their data access. With the ability to automatically rewrite user SQL queries to utilize these dynamic tables, the team saw a glimmer of hope. They were particularly drawn to the declarative programming aspect of Dynamic Tables, allowing them to define their pipeline outcomes using straightforward SQL without worrying about the intricacies of the steps involved.

The Implementation
With a sense of urgency, the Harvest Analytics team quickly set up Dynamic Tables to ingest their Kafka data. They configured the tables to refresh every minute, ensuring that the latest sales data was always available for analysis. The key advantage was how these dynamic tables seamlessly integrated with their existing infrastructure, effectively bridging the gap between external lakehouse data and internal analytics.

[图片]
CREATE DYNAMIC TABLE dynamic_table_orders SCHEDULE '5 * * * ' AS SELECT COUNT() AS a FROM foreign_table_orders WHERE amout > 100;

The Results

As Thanksgiving Day arrived, the transformation was evident:

Real-Time Insights: Thanks to Dynamic Tables, the team could now perform continuous queries on their Kafka data, aggregating sales metrics every minute. They could visualize total sales in real-time, empowering them to adjust marketing strategies on the fly.
Automatic SQL Rewriting: When team members queried external data, Cloudberry automatically recognized SQL and rewrote it to utilize the Dynamic Tables. This meant that users could focus on their analysis without worrying about the underlying complexity.
Speed and Efficiency: The performance boost was staggering. Queries that once took minutes now returned results in seconds, allowing Harvest Analytics to react swiftly to customer behavior and maximize sales opportunities.
Seamless Integration: The implementation of Dynamic Tables was smooth and required minimal changes to their existing workflows. The team could continue using their familiar tools while benefiting from the advanced capabilities of Cloudberry.
Simplified Pipeline Management: The declarative nature of Dynamic Tables reduced the complexity of their data workflows, allowing the team to focus on outcomes rather than technical details. This simplification meant that even complex data operations became manageable.
Flexible Data Pipelines: With transparent orchestration, the team could easily construct pipelines tailored to their needs, ensuring that data was always up-to-date and ready for analysis.

The Impact

As the day unfolded, Harvest Analytics experienced record-breaking sales. Their ability to monitor and respond to trends in real-time transformed their Thanksgiving event into a resounding success. The team celebrated not only their sales achievements but also the newfound power of Dynamic Tables.

Encouraged by their success, Harvest Analytics shared their story with other companies in the industry. They highlighted how Cloudberry’s Dynamic Tables had changed the game, allowing them to run queries on external Kafka data as swiftly as if it were internal.

A Call to Action

If your organization is grappling with the same challenges as Harvest Analytics, it’s time to unlock the potential of Dynamic Tables in Apache Cloudberry. Experience the benefits of auto-refreshing materialized views, seamless SQL rewriting, and lightning-fast queries on lakehouse data.

Join the growing community of Cloudberry users who are transforming their data strategies and driving success in their businesses. Don’t let external data slow you down—embrace Dynamic Tables and watch your insights flourish this Thanksgiving and beyond!

Welcome to Apache Cloudberry:

Visit the website: https://cloudberry.apache.org
Follow us on GitHub: https://github.com/apache/cloudberry
Join Slack workspace: https://apache-cloudberry.slack.com
Dev mailing list:
- To subscribe to dev mailing list: Send an email to dev-subscribe@cloudberry.apache.org
- To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

基于 Cloudberry 的全文检索方案对比：ParadeDB BM25 vs GIN vs ZomboDB

Maya S. — Fri, 14 Nov 2025 09:10:08 +0000