<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maya S.</title>
    <description>The latest articles on DEV Community by Maya S. (@maya_sun_29e7bf629e5dd7b3).</description>
    <link>https://dev.to/maya_sun_29e7bf629e5dd7b3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3610700%2Fc605f627-c79a-4346-b78e-7c6d5c27603b.jpg</url>
      <title>DEV Community: Maya S.</title>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maya_sun_29e7bf629e5dd7b3"/>
    <language>en</language>
    <item>
      <title>Apache Cloudberry 2.0: Rebuilding Storage for the Cloud-Native Era with PAX</title>
      <dc:creator>Maya S.</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:37:44 +0000</pubDate>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3/apache-cloudberry-20-rebuilding-storage-for-the-cloud-native-era-with-pax-5a9b</link>
      <guid>https://dev.to/maya_sun_29e7bf629e5dd7b3/apache-cloudberry-20-rebuilding-storage-for-the-cloud-native-era-with-pax-5a9b</guid>
      <description>&lt;h2&gt;
  
  
  Rethinking AOCS: When Architecture Meets a New Infrastructure Reality
&lt;/h2&gt;

&lt;h2&gt;
  
  
  From a Solid Design to a Structural Mismatch
&lt;/h2&gt;

&lt;p&gt;The AO/AOCS storage engine, inherited from Greenplum, was originally built for on-premises environments. Its design—column-per-file with append-only writes—worked well on block storage and traditional file systems, delivering stable performance for OLAP workloads.&lt;br&gt;
But the infrastructure landscape has changed.&lt;br&gt;
As storage shifts toward cloud-native object storage, the assumptions behind AOCS no longer hold. Object storage favors large, sequential I/O and request aggregation, while AOCS relies on independent column files and frequent small appends. The result is not just inefficiency—it is a structural mismatch.&lt;br&gt;
In real-world workloads, this manifests as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exploding request counts when scanning wide tables (one request per column per file)&lt;/li&gt;
&lt;li&gt;Severe request amplification due to unmerged small writes&lt;/li&gt;
&lt;li&gt;Degraded sequential read performance caused by fragmented column layouts
At the same time, tight kernel coupling and limited thread-safety make it difficult to fully leverage multi-threading and vectorized execution.
What used to be a reasonable design has now become a constraint—not just on performance, but on the system’s ability to evolve.
Why Incremental Fixes Were Not Enough
Extensive stress testing revealed a clear pattern: the bottleneck was not localized—it was systemic.
Tuning parameters, improving caches, or adding execution-layer optimizations helped, but only marginally. The core issue remained: the storage model itself was not aligned with the cloud environment.
Continuing to patch AOCS would only introduce more layers of complexity and technical debt.
The conclusion was straightforward: instead of adapting a legacy design, Cloudberry needed a storage engine built for object storage from the ground up.
This led to the introduction of PAX.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  PAX: A Storage Model Designed for the Cloud
&lt;/h2&gt;

&lt;p&gt;PAX is not just a replacement for AOCS. It is a redefinition of how storage should work in a cloud-native data warehouse—balancing analytical performance, transactional needs, and long-term evolvability.&lt;br&gt;
A New Paradigm: Row–Column Co-existence&lt;br&gt;
Traditional database systems force a trade-off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row storage → optimized for transactions&lt;/li&gt;
&lt;li&gt;Column storage → optimized for analytics
PAX removes this dichotomy.
Within the same physical file and logical block, PAX organizes data in a columnar layout while preserving row-level access semantics. This hybrid design enables:&lt;/li&gt;
&lt;li&gt;Efficient analytical scans by reading only required columns&lt;/li&gt;
&lt;li&gt;Merged multi-column writes to reduce small-file pressure on object storage&lt;/li&gt;
&lt;li&gt;Shared file structures across columns, significantly reducing request overhead
The result is a storage model that performs consistently across mixed OLTP + OLAP workloads, which is increasingly common in modern data platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A Layered Architecture Built for Evolution
&lt;/h2&gt;

&lt;p&gt;PAX adopts a strictly layered design to ensure modularity and long-term extensibility:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxc9biw35e9oxnz3lx8cw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxc9biw35e9oxnz3lx8cw.png" alt=" " width="460" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access Handler Layer
Integrates with Cloudberry’s Access Method (AM), handling transactions and lifecycle management.&lt;/li&gt;
&lt;li&gt;Table Layer
Bridges execution engines and storage, supporting both row-based and vectorized execution.&lt;/li&gt;
&lt;li&gt;MicroPartition Layer
Manages physical data organization (files and stripes), including statistics and pruning logic.&lt;/li&gt;
&lt;li&gt;Column Layer
Defines in-memory column structures, handling encoding, decoding, and alignment.&lt;/li&gt;
&lt;li&gt;File Layer
Encapsulates storage interactions, including data files, metadata, and visibility maps.
This separation of concerns allows PAX to evolve independently at each layer, paving the way for features like multi-threaded execution and distributed transactions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Metadata Management: A Lightweight Control Plane for Storage
&lt;/h2&gt;

&lt;p&gt;PAX adopts a lightweight yet effective metadata management strategy based on auxiliary tables built on the Heap Access Method (Heap AM).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jb1nvuagnqwl5abqr2s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jb1nvuagnqwl5abqr2s.png" alt=" " width="593" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each physical data file corresponds to a single record in the auxiliary table. This mapping provides a consistent control plane for storage, enabling the engine to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quickly locate data files&lt;/li&gt;
&lt;li&gt;Track file lifecycle changes&lt;/li&gt;
&lt;li&gt;Evaluate transactional visibility efficiently
The auxiliary table maintains essential metadata such as file identifiers, states, and visibility-related attributes, ensuring that storage operations remain both predictable and low-overhead.
In addition, PAX maintains a global fast sequence table to generate unique BLOCKNAMEs, guaranteeing globally unique file naming across nodes and transactions.
More importantly, this mechanism serves as the foundation for associating Visimap files with their corresponding data files, ensuring correctness and consistency in distributed visibility control.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Rethinking MVCC for Object Storage&lt;br&gt;
Traditional MVCC in PostgreSQL relies on row-level versioning. In object storage, this approach becomes prohibitively expensive due to excessive I/O and metadata operations.&lt;br&gt;
PAX introduces a file-level visibility model.&lt;br&gt;
Instead of tracking visibility per row, PAX uses Visimap files to represent visibility at the file level:&lt;br&gt;
&lt;em&gt;&lt;/em&gt;.visimap&lt;br&gt;
This enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lock-free concurrent reads&lt;/li&gt;
&lt;li&gt;Minimal metadata overhead&lt;/li&gt;
&lt;li&gt;Efficient visibility checks at read time
It’s a fundamental shift that aligns concurrency control with the realities of object storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  PORC_VEC: When Storage Becomes Execution
&lt;/h2&gt;

&lt;p&gt;One of the most impactful innovations in PAX is PORC_VEC (PostgreSQL ORC Vectorized).&lt;br&gt;
In traditional systems, data must be transformed into a vectorized format before execution—incurring CPU and memory overhead. PORC_VEC eliminates this step entirely.&lt;br&gt;
Key characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-copy reads: data is consumed directly by the execution engine&lt;/li&gt;
&lt;li&gt;Cache-aligned layout: optimized for modern CPU architectures&lt;/li&gt;
&lt;li&gt;Unified metadata model: aligned with in-memory column structures
This leads to a powerful principle:
The storage format is the execution format.
In internal tests, PORC_VEC reduces CPU usage by ~20% and improves query throughput by 15–25%.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Column Layer: Bridging Storage and Execution
&lt;/h2&gt;

&lt;p&gt;The Column layer serves as the core in-memory abstraction for columnar data in PAX, bridging persistent storage and the execution engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lhv8557sqlnn3qr46tw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lhv8557sqlnn3qr46tw.png" alt=" " width="441" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is responsible for both data representation and transformation, with a design centered on efficiency, flexibility, and alignment with vectorized execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disk-to-Memory Mapping
Column loads column data from disk into memory and flushes in-memory data back to storage during write operations.&lt;/li&gt;
&lt;li&gt;Format Transformation
It performs efficient format conversion along read and write paths, ensuring consistency between on-disk and in-memory layouts while minimizing overhead.&lt;/li&gt;
&lt;li&gt;Encoding and Compression
Multiple techniques—such as RLEv2, dictionary encoding, and ZSTD—are integrated to reduce storage footprint without sacrificing query performance.&lt;/li&gt;
&lt;li&gt;Flexible Access Interfaces

&lt;ul&gt;
&lt;li&gt;Row-level interfaces for transactional workloads&lt;/li&gt;
&lt;li&gt;Batch-oriented interfaces for analytical and vectorized execution&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Memory Alignment and Complex Type Optimization

&lt;ul&gt;
&lt;li&gt;Memory layout follows CPU cache alignment principles to improve access efficiency&lt;/li&gt;
&lt;li&gt;Complex types (e.g., arrays and range types) adopt independent alignment and offset control to reduce parsing overhead
With these design choices, the Column layer balances performance, memory efficiency, and concurrency scalability, while providing a solid foundation for vectorized execution and parallel scanning.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance Foundations: Four Key Mechanisms
&lt;/h2&gt;

&lt;p&gt;PAX’s performance gains are not accidental—they are the result of deliberate architectural choices.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Sparse Filtering&lt;br&gt;
By maintaining min/max statistics and Bloom Filters at file and stripe levels, PAX can aggressively prune irrelevant data.&lt;br&gt;
Example:&lt;br&gt;
A query like WHERE age &amp;lt; 18 skips entire data blocks where min(age) &amp;gt; 18.&lt;br&gt;
This reduces I/O requests by over 60% on average, bringing object storage performance closer to in-memory systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Intelligent Physical Layout (Cluster)&lt;br&gt;
PAX aligns physical data layout with query patterns through automatic clustering:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Z-Order → optimized for multi-dimensional range queries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lexical Order → optimized for multi-column filtering&lt;br&gt;
This improves data locality and significantly reduces random I/O.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Modern Memory Management&lt;br&gt;
PAX evolved through three stages of memory management, ultimately adopting:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Smart pointers (unique_ptr, shared_ptr)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Thread-aware resource management&lt;br&gt;
This ensures:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No memory leaks under high concurrency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Safe cleanup during early exits or failures&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stable behavior in multi-threaded vectorized execution&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Benchmark Results: Quantifying the Gains
&lt;/h2&gt;

&lt;p&gt;In 1TB TPC-H and TPC-DS benchmarks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average performance improvement: 15%–25%&lt;/li&gt;
&lt;li&gt;Complex queries (joins, aggregations): up to 40% faster
These gains come from:&lt;/li&gt;
&lt;li&gt;Reduced I/O amplification&lt;/li&gt;
&lt;li&gt;Lower CPU overhead via zero-copy execution&lt;/li&gt;
&lt;li&gt;More stable latency under complex workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing Thoughts: Engineering for the Real World
&lt;/h2&gt;

&lt;p&gt;PAX reflects a deliberate shift in engineering philosophy:&lt;br&gt;
Not optimizing around constraints—but removing them.&lt;br&gt;
By aligning storage design with object storage characteristics, and tightly integrating execution with data format, PAX establishes a foundation that is both high-performance and future-proof.&lt;br&gt;
Looking ahead, Cloudberry will continue to evolve PAX with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delta storage for incremental updates&lt;/li&gt;
&lt;li&gt;Deeper optimizer integration&lt;/li&gt;
&lt;li&gt;SIMD-accelerated execution&lt;/li&gt;
&lt;li&gt;Adaptive, self-tuning statistics
All with a single goal: to make Cloudberry a truly，continuously evolving data platform.
Welcome to Apache Cloudberry:&lt;/li&gt;
&lt;li&gt;Visit the website: &lt;a href="https://cloudberry.apache.org" rel="noopener noreferrer"&gt;https://cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Follow us on GitHub: &lt;a href="https://github.com/apache/cloudberry" rel="noopener noreferrer"&gt;https://github.com/apache/cloudberry&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Join Slack workspace: &lt;a href="https://apache-cloudberry.slack.com" rel="noopener noreferrer"&gt;https://apache-cloudberry.slack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dev mailing list:

&lt;ul&gt;
&lt;li&gt;To subscribe to dev mailing list: Send an email to &lt;a href="mailto:dev-subscribe@cloudberry.apache.org"&gt;dev-subscribe@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To browse past dev mailing list discussions: &lt;a href="https://lists.apache.org/list.html?dev@cloudberry.apache.org" rel="noopener noreferrer"&gt;https://lists.apache.org/list.html?dev@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>apachecloudberry</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Rethinking Stream-Batch Unification: Real-Time Processing with Incremental Materialized Views in Apache Cloudberry</title>
      <dc:creator>Maya S.</dc:creator>
      <pubDate>Tue, 16 Dec 2025 09:34:04 +0000</pubDate>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3/rethinking-stream-batch-unification-real-time-processing-with-incremental-materialized-views-in-3b4d</link>
      <guid>https://dev.to/maya_sun_29e7bf629e5dd7b3/rethinking-stream-batch-unification-real-time-processing-with-incremental-materialized-views-in-3b4d</guid>
      <description>&lt;p&gt;Apache Cloudberry is an advanced and mature open-source Massively Parallel Processing (MPP) database, derived from the open-source version of the Pivotal Greenplum Database® but built on a more modern PostgreSQL kernel and with more advanced enterprise capabilities. Cloudberry can serve as a data warehouse and can also be used for large-scale analytics and AI/ML workloads.&lt;/p&gt;

&lt;p&gt;In today’s data-driven landscape, “real-time” capabilities have become a business imperative. Every company wants to detect changes instantly and respond to user needs as they happen. Streaming engines such as Apache Flink, with their powerful capabilities and ultra-low latency, have set a compelling vision for what real-time data processing can achieve.&lt;br&gt;
Yet the reality is often far more complicated. For many organizations — especially those without large, specialized engineering teams — building and maintaining a Flink-based, stream-batch unified platform can be both powerful and painful. You gain real-time insights, but only by accepting significant architectural complexity and operational overhead.&lt;br&gt;
Is there a simpler, more elegant path to stream-batch unification?&lt;br&gt;
 Yes — and it has become increasingly practical.&lt;br&gt;
With the rise of modern database technologies, solutions such as Incremental Materialized Views (IVM) in Apache Cloudberry are emerging as a cleaner, lighter alternative: in-database stream processing.&lt;/p&gt;




&lt;p&gt;The “Heavyweight” Approach: The Power and Pain of Flink&lt;br&gt;
A Flink-centered architecture is undoubtedly powerful, but it also comes with several burdens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Complex architecture and high operational costs
A typical pipeline stitches together many components — applications, MySQL, CDC tools, Kafka, Flink, and a data warehouse or data lake. Each component requires specialized expertise, and a failure in any part can break the entire chain.&lt;/li&gt;
&lt;li&gt;High development overhead
In the classic Lambda architecture, teams must maintain two separate codebases — one for streaming (Flink) and one for batch (Spark or Hive). That means double the logic, double the testing, and a persistent risk of inconsistency.&lt;/li&gt;
&lt;li&gt;Steep learning curve
Mastering Flink is non-trivial. State management, time semantics, watermarks, windowing, and performance tuning demand deep expertise and continuous operational effort — something many teams cannot afford.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Simplifying Stream-Batch Processing Inside the Database&lt;br&gt;
Cloudberry takes a bold yet simple approach:&lt;br&gt;
 Why not let the database itself handle streaming computation?&lt;br&gt;
This is the essence of in-database stream-batch unification, powered by Incremental Materialized Views (IVM).&lt;br&gt;
An IVM functions as a “live” materialized result that automatically stays up to date.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch phase: When you run a CREATE INCREMENTAL MATERIALIZED VIEW command, Cloudberry performs a full historical computation to build the initial view — the batch layer.&lt;/li&gt;
&lt;li&gt;Stream phase: Subsequent INSERT, UPDATE, and DELETE operations on the source tables are captured automatically. The engine computes only the incremental changes and updates the view in near real time — typically within milliseconds to seconds.
This fundamentally simplifies what used to be a complex and error-prone workflow. Previously, teams had to define Kafka message schemas and Flink-specific data structures, and write large amounts of Flink SQL (covering data sources, windows, aggregations, dimension joins, and output tables) just to complete a single task. For example:
// Kafka data structure
{
"sales_id": 8435,
"event_type": "+I",
"event_time": "2025-06-27 07:53:21Z",
"ticket_number": 8619628,
"item_sk": 6687,
"customer_sk": 69684,
"store_sk": 238,
"quantity": 6,
"sales_price": 179.85,
"ext_sales_price": 1079.1,
"net_profit": 672,
"event_source": "CDC-TO-KAFKA-FIXED"
}
Before Flink can process streaming data, the data must be persisted to ensure correctness and support replay in case of failures. Therefore, CDC → Kafka → Flink always introduces additional transformation, configuration, and operational complexity.
The following Flink SQL illustrates only the streaming computation portion — the full pipeline requires even more components and code:
-- Create the TPC-DS store performance aggregation result output table (output to console)
CREATE TABLE store_daily_performance (
window_start TIMESTAMP(3),
window_end TIMESTAMP(3),
s_store_sk INT,
s_store_name STRING,
s_state STRING,
s_market_manager STRING,
sale_date STRING,
total_sales_amount DECIMAL(10,2),
total_net_profit DECIMAL(10,2),
total_items_sold BIGINT,
transaction_count BIGINT,
avg_sales_price DECIMAL(7,2),
process_time TIMESTAMP_LTZ(3)
) WITH (
'connector'='print',
'print-identifier'='TPCDS-STORE-PERFORMANCE'
);&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-- Core aggregation query&lt;br&gt;
INSERT INTO store_daily_performance&lt;br&gt;
SELECT&lt;br&gt;
    window_start,&lt;br&gt;
    window_end,&lt;br&gt;
    s.ss_store_sk,&lt;br&gt;
    COALESCE(sd.s_store_name, CONCAT('Store #', CAST(s.ss_store_sk AS STRING))) AS s_store_name,&lt;br&gt;
    COALESCE(sd.s_state, 'Unknown') AS s_state,&lt;br&gt;
    COALESCE(sd.s_market_manager, 'Unknown Manager') AS s_market_manager,&lt;br&gt;
    DATE_FORMAT(window_start, 'yyyy-MM-dd') AS sale_date,&lt;br&gt;
    SUM(CASE WHEN s.event_type = '+I' THEN s.ss_ext_sales_price &lt;br&gt;
             WHEN s.event_type = '-D' THEN -s.ss_ext_sales_price &lt;br&gt;
             ELSE 0 END) AS total_sales_amount,&lt;br&gt;
    SUM(CASE WHEN s.event_type = '+I' THEN s.ss_net_profit &lt;br&gt;
             WHEN s.event_type = '-D' THEN -s.ss_net_profit &lt;br&gt;
             ELSE 0 END) AS total_net_profit,&lt;br&gt;
    SUM(CASE WHEN s.event_type = '+I' THEN s.ss_quantity &lt;br&gt;
             WHEN s.event_type = '-D' THEN -s.ss_quantity &lt;br&gt;
             ELSE 0 END) AS total_items_sold,&lt;br&gt;
    COUNT(DISTINCT s.ss_ticket_number) AS transaction_count,&lt;br&gt;
    AVG(s.ss_sales_price) AS avg_sales_price,&lt;br&gt;
    LOCALTIMESTAMP AS process_time&lt;br&gt;
FROM TABLE(&lt;br&gt;
    TUMBLE(TABLE sales_events_source, DESCRIPTOR(event_time), INTERVAL '1' MINUTE)&lt;br&gt;
) s&lt;br&gt;
LEFT JOIN store_dim sd ON s.ss_store_sk = sd.s_store_sk&lt;br&gt;
WHERE s.event_type IN ('+I', '-D', 'U')&lt;br&gt;
GROUP BY&lt;br&gt;
    window_start, &lt;br&gt;
    window_end,&lt;br&gt;
    s.ss_store_sk,&lt;br&gt;
    sd.s_store_name,&lt;br&gt;
    sd.s_state,&lt;br&gt;
    sd.s_market_manager;&lt;br&gt;
By contrast, Cloudberry IVM can express the same task in a single SQL statement:&lt;br&gt;
CREATE INCREMENTAL MATERIALIZED VIEW tpcds.store_daily_performance_enriched_ivm&lt;br&gt;
AS&lt;br&gt;
SELECT&lt;br&gt;
    ss.ss_store_sk AS store,&lt;br&gt;
    s.s_store_name AS store_name,&lt;br&gt;
    s.s_state AS state,&lt;br&gt;
    s.s_market_manager AS manager,&lt;br&gt;
    d.d_date AS sold_date,&lt;br&gt;
    SUM(ss.ss_net_paid_inc_tax) AS total_sales_amount,&lt;br&gt;
    SUM(ss.ss_net_profit) AS total_net_profit,&lt;br&gt;
    SUM(ss.ss_quantity) AS total_items_sold,&lt;br&gt;
    COUNT(ss.ss_ticket_number) AS transaction_count&lt;br&gt;
FROM&lt;br&gt;
    tpcds.store_sales_heap ss&lt;br&gt;
JOIN&lt;br&gt;
    tpcds.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk&lt;br&gt;
JOIN&lt;br&gt;
    tpcds.store s ON ss.ss_store_sk = s.s_store_sk&lt;br&gt;
GROUP BY&lt;br&gt;
    ss.ss_store_sk,&lt;br&gt;
    s.s_store_name,&lt;br&gt;
    s.s_state,&lt;br&gt;
    s.s_market_manager,&lt;br&gt;
    d.d_date&lt;br&gt;
DISTRIBUTED BY (ss_store_sk);&lt;br&gt;
All the complexity — state management, consistency handling, incremental computation, scheduling, and triggering — is transparently handled by the database engine. This eliminates the need to orchestrate numerous intermediate streaming jobs and significantly reduces development and operational costs.&lt;/p&gt;

&lt;p&gt;The Perfect Pair: Incremental Materialized Views and Dynamic Tables&lt;br&gt;
Cloudberry also provides another mechanism: Dynamic Tables.&lt;br&gt;
 While both are types of materialized views, they serve different purposes depending on latency requirements and workload characteristics.&lt;br&gt;
In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose IVM when you need low latency and immediate updates.&lt;/li&gt;
&lt;li&gt;Choose Dynamic Tables when you can tolerate some delay and need to handle large datasets efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Practical Considerations: Performance and Limitations&lt;br&gt;
No technology is perfect, and IVM is no exception.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performance overhead: Because IVMs update incrementally on every write, they add some transactional overhead to source tables — especially when multiple IVMs depend on the same table.&lt;/li&gt;
&lt;li&gt;Feature limitations: The current version of Cloudberry IVM does not yet support MIN, MAX, window functions, LEFT/OUTER JOIN, CTEs, or partitioned tables.
These gaps are actively being addressed by the open-source community.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conclusion&lt;br&gt;
For top-tier internet companies, investing in large, Flink-based infrastructures makes sense — they can absorb the complexity in pursuit of maximum flexibility and performance.&lt;br&gt;
But most organizations do not need heavyweight systems. They need a simple, reliable, and cost-effective way to gain real-time insights.&lt;br&gt;
Cloudberry’s Incremental Materialized Views provide exactly that:a unified stream-batch processing model built directly into the database, powered by plain SQL, with consistency, simplicity, and efficiency in a single system.&lt;br&gt;
This may well be the most practical path to bringing real-time data capabilities to every enterprise.&lt;/p&gt;

&lt;p&gt;Welcome to Apache Cloudberry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visit the website: &lt;a href="https://cloudberry.apache.org" rel="noopener noreferrer"&gt;https://cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Follow us on GitHub: &lt;a href="https://github.com/apache/cloudberry" rel="noopener noreferrer"&gt;https://github.com/apache/cloudberry&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Join Slack workspace: &lt;a href="https://apache-cloudberry.slack.com" rel="noopener noreferrer"&gt;https://apache-cloudberry.slack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dev mailing list:

&lt;ul&gt;
&lt;li&gt;To subscribe to dev mailing list: Send an email to &lt;a href="mailto:dev-subscribe@cloudberry.apache.org"&gt;dev-subscribe@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To browse past dev mailing list discussions: &lt;a href="https://lists.apache.org/list.html?dev@cloudberry.apache.org" rel="noopener noreferrer"&gt;https://lists.apache.org/list.html?dev@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Migrate the legacy Greenplum to Apache Cloudberry with cbcopy</title>
      <dc:creator>Maya S.</dc:creator>
      <pubDate>Tue, 16 Dec 2025 09:33:15 +0000</pubDate>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3/migrate-the-legacy-greenplum-to-apache-cloudberry-with-cbcopy-2fhf</link>
      <guid>https://dev.to/maya_sun_29e7bf629e5dd7b3/migrate-the-legacy-greenplum-to-apache-cloudberry-with-cbcopy-2fhf</guid>
      <description>&lt;p&gt;In the field of data warehousing and big data analytics, Greenplum Database has long been recognized as a leading open-source Massively Parallel Processing (MPP) database. However, since Greenplum transitioned to a closed-source model, users have increasingly encountered limitations in areas such as version upgrades, bug fixes, and feature extensions. Against this backdrop, Apache Cloudberry emerged.&lt;/p&gt;

&lt;p&gt;As an open-source derivative of Greenplum, Apache Cloudberry is highly compatible with Greenplum’s architecture and SQL syntax while providing comprehensive enhancements in functionality, performance, and security. Apache Cloudberry has quickly become the most promising open-source alternative to Greenplum.&lt;/p&gt;

&lt;p&gt;Beyond supporting more efficient parallel query execution and advanced resource management, the Apache Cloudberry community also introduces a dedicated data loading and migration tool — cbcopy — which enables seamless and highly efficient migration from Greenplum to Cloudberry.&lt;/p&gt;

&lt;p&gt;This article provides an in-depth overview of cbcopy — its features, internal mechanisms, and practical usage — followed by a complete case study demonstrating how to perform a fast and seamless migration from Greenplum to Apache Cloudberry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to cbcopy
&lt;/h2&gt;

&lt;p&gt;cbcopy is a data migration utility designed to transfer data across different database clusters. It can quickly replicate both metadata and actual data from a Greenplum cluster to an Apache Cloudberry cluster. The tool supports full migration of database objects, including schemas, tables, indexes, views, roles, user-defined functions, resource queues, and resource groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supported Levels of Migration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;cbcopy supports four levels of database object migration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster-level migration – migrates the entire source cluster to the target cluster.&lt;/li&gt;
&lt;li&gt;Database-level migration – migrates a specific database from the source cluster to the target cluster.&lt;/li&gt;
&lt;li&gt;Schema-level migration – migrates a specified schema within a database from the source cluster to a target database.&lt;/li&gt;
&lt;li&gt;Table-level migration – migrates specific tables from the source cluster to the target cluster.
&lt;strong&gt;Support for Different Cluster Scales&lt;/strong&gt;
cbcopy can handle migrations between clusters with different numbers of compute nodes (segments):&lt;/li&gt;
&lt;li&gt;The source and target clusters have the same number of segments.&lt;/li&gt;
&lt;li&gt;The source cluster has fewer segments than the target cluster.&lt;/li&gt;
&lt;li&gt;The source cluster has more segments than the target cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  cbcopy Mechanism
&lt;/h3&gt;

&lt;p&gt;The cbcopy utility is implemented using the COPY ON SEGMENT TO PROGRAM mechanism and the external table feature. It employs data compression during transmission to reduce network resource usage and uses checksum verification to ensure data consistency between clusters.&lt;/p&gt;

&lt;p&gt;To maximize migration performance in distributed environments, cbcopy automatically applies one of two migration strategies depending on the size of the source table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small tables (default threshold: fewer than 1,000,000 rows):
Data is transferred via a direct connection between the source cluster’s master node and the target cluster’s coordinator node.&lt;/li&gt;
&lt;li&gt;Large tables (default threshold: more than 1,000,000 rows):&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;cbcopy launches helper processes on both the source and target cluster segments. These helpers establish direct connections and perform parallel data transfers between segments to achieve higher throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  cbcopy in Practice
&lt;/h2&gt;

&lt;p&gt;In this section, we demonstrate how to use cbcopy to migrate data from a Greenplum 6 database cluster to an Apache Cloudberry cluster in a test environment.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Test Environment
Source: Greenplum Cluster
| IP Address | Configuration | Version | Role |
| :--- | :--- | :--- | :--- |
| 192.168.194.55 | 4C / 16GB | Greenplum 6.27.1 | Master |
| 192.168.197.120 | 4C / 16GB | Greenplum 6.27.1 | Segment |
| 192.168.192.215 | 4C / 16GB | Greenplum 6.27.1 | Segment |
Target: Apache Cloudberry Cluster
| IP Address | Configuration | Version | Role |
| :--- | :--- | :--- | :--- |
| 192.168.194.137 | 4C / 16GB | Apache Cloudberry 2.0.0 | Coordinator |
| 192.168.192.93 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment |
| 192.168.196.69 | 4C / 16GB | Apache Cloudberry 2.0.0 | Segment |&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Test Data&lt;/p&gt;

&lt;p&gt;The warehouse database in the Greenplum 6 cluster contains simulated “banking data warehouse” test data, including both transactional and historical tables.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;IP Address&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;192.168.194.137&lt;/td&gt;
&lt;td&gt;4C / 16GB&lt;/td&gt;
&lt;td&gt;Apache Cloudberry 2.0.0&lt;/td&gt;
&lt;td&gt;Coordinator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;192.168.192.93&lt;/td&gt;
&lt;td&gt;4C / 16GB&lt;/td&gt;
&lt;td&gt;Apache Cloudberry 2.0.0&lt;/td&gt;
&lt;td&gt;Segment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;192.168.196.69&lt;/td&gt;
&lt;td&gt;4C / 16GB&lt;/td&gt;
&lt;td&gt;Apache Cloudberry 2.0.0&lt;/td&gt;
&lt;td&gt;Segment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## 2. Migration Preparation&lt;/span&gt;

Before running cbcopy, verify that the file &lt;span class="sb"&gt;`/usr/local/greenplum-db/bin/gpcopy_helper`&lt;/span&gt; exists on each Greenplum 6 node.

If it does not exist, copy the cbcopy_helper binary from the Cloudberry cluster to the Greenplum nodes as follows:

&lt;span class="gs"&gt;**On the Cloudberry coordinator node (192.168.194.137):**&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
su - root&lt;br&gt;
scp /usr/local/cloudberry-db/bin/cbcopy_helper 192.168.194.55:/usr/local/greenplum-db/bin/&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**On the Greenplum 6 master node (192.168.194.55):**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
su - root&lt;br&gt;
source /usr/local/greenplum-db/greenplum_path.sh&lt;br&gt;
cd /usr/local/greenplum-db/bin&lt;br&gt;
gpscp -f /home/gpadmin/hostfile_all cbcopy_helper =:$PWD/&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
---

## 3. Data Migration

Execute the cbcopy command from the Apache Cloudberry coordinator node to migrate data. cbcopy supports full cluster, database-level, schema-level, or table-level migration.

Migration logs are saved under `/home/gpadmin/gpAdminLogs` on the execution node.

### 3.1 Full Cluster Migration

Migrate all databases (in this case, dw and warehouse) from the Greenplum 6 cluster to the Cloudberry cluster:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
export PGPASSWORD=gpadmin&lt;br&gt;
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \&lt;br&gt;
       --dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \&lt;br&gt;
       --full --truncate --compression&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### 3.2 Database-Level Migration

Migrate only the warehouse database:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
export PGPASSWORD=gpadmin&lt;br&gt;
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \&lt;br&gt;
       --dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \&lt;br&gt;
       --dbname="warehouse" --truncate --compression&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### 3.3 Schema-Level Migration

Migrate the sh1 schema in the warehouse database:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
export PGPASSWORD=gpadmin&lt;br&gt;
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \&lt;br&gt;
       --dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \&lt;br&gt;
       --truncate --compression --schema=warehouse.sh1&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### 3.4 Table-Level Migration

Migrate the warehouse.public.cancel_accounts table:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
export PGPASSWORD=gpadmin&lt;br&gt;
cbcopy --source-host=192.168.194.55 --source-port=5432 --source-user=gpadmin \&lt;br&gt;
       --dest-host=192.168.194.137 --dest-port=5432 --dest-user=gpadmin \&lt;br&gt;
       --truncate --compression --include-table="warehouse.public.cancel_accounts"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### 3.5 Common Parameters

| Parameter | Description |
| :--- | :--- |
| `--source-host` | Hostname or IP address of the source database master |
| `--source-port` | Port number of the source master |
| `--source-user` | User ID for connecting to the source database |
| `--dest-host` | Hostname or IP address of the target coordinator |
| `--dest-port` | Port number of the target coordinator |
| `--dest-user` | User ID for connecting to the target database |
| `--full` | Migrate the entire cluster; cannot be used with `--dbname`, `--include-table`, etc. |
| `--dbname` | Comma-separated list of source databases to copy |
| `--schema` | Comma-separated list of schemas to copy (format: `database.schema`) |
| `--include-table` | Comma-separated list of tables to copy (format: `database.schema.table`) |
| `--metadata-only` | Copy only metadata (DDL) without data |
| `--data-only` | Copy only data, excluding metadata |
| `--on-segment-threshold` | Row-count threshold for enabling segment-level parallel copy (default: 1,000,000) |
| `--truncate` | Truncate existing target tables before copy |
| `--append` | Append data to existing target tables |
| `--copy-jobs` | Number of parallel copy processes (default: 4) |
| `--compression` | Enable compression during data transfer |


## 4. Post-Migration Validation

After migration, compare the warehouse database objects (tables, indexes, views, sequences, and functions) between the Greenplum 6 source and the Apache Cloudberry target to verify data and metadata integrity.

### Object Validation (Tables, Indexes, Views, Sequences)

**1. Source: Greenplum 6**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT n.nspname as "Schema",&lt;br&gt;
  c.relname as "Name",&lt;br&gt;
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",&lt;br&gt;
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner", CASE c.relstorage WHEN 'h' THEN 'heap' WHEN 'x' THEN 'external' WHEN 'a' THEN 'append only' WHEN 'v' THEN 'none' WHEN 'c' THEN 'append only columnar' WHEN 'p' THEN 'Apache Parquet' WHEN 'f' THEN 'foreign' END as "Storage"&lt;br&gt;
FROM pg_catalog.pg_class c&lt;br&gt;
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace&lt;br&gt;
WHERE c.relkind IN ('r','v','m','S','f','')&lt;br&gt;
AND c.relstorage IN ('h', 'a', 'c','x','f','v','')&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'pg_catalog'&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'information_schema'&lt;br&gt;
      AND n.nspname !~ '^pg_toast'&lt;br&gt;
  AND pg_catalog.pg_table_is_visible(c.oid)&lt;br&gt;
ORDER BY 1,2;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**2. Target: Apache Cloudberry**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT n.nspname as "Schema",&lt;br&gt;
  c.relname as "Name",&lt;br&gt;
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'd' THEN 'directory table' WHEN 'v' THEN 'view' WHEN 'm' THEN CASE c.relisdynamic WHEN true THEN 'dynamic table' ELSE 'materialized view' END WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",&lt;br&gt;
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",&lt;br&gt;
  CASE a.amname WHEN 'ao_column' THEN 'append only columnar' WHEN 'ao_row' THEN 'append only' ELSE a.amname END as "Storage"&lt;br&gt;
FROM pg_catalog.pg_class c&lt;br&gt;
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace&lt;br&gt;
     LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam&lt;br&gt;
     LEFT JOIN pg_catalog.pg_am a ON a.oid = c.relam&lt;br&gt;
WHERE c.relkind IN ('r','p','v','m','S','f','')&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'pg_catalog'&lt;br&gt;
      AND n.nspname !~ '^pg_toast'&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'information_schema'&lt;br&gt;
      AND n.nspname !~ '^pg_toast'&lt;br&gt;
  AND pg_catalog.pg_table_is_visible(c.oid)&lt;br&gt;
ORDER BY 1,2;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### User-Defined Function Validation

**1. Source: Greenplum 6**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT n.nspname as "Schema",&lt;br&gt;
  p.proname as "Name",&lt;br&gt;
  pg_catalog.pg_get_function_result(p.oid) as "Result data type",&lt;br&gt;
  pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",&lt;br&gt;
 CASE&lt;br&gt;
  WHEN p.proisagg THEN 'agg'&lt;br&gt;
  WHEN p.proiswindow THEN 'window'&lt;br&gt;
  WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'&lt;br&gt;
  ELSE 'func'&lt;br&gt;
 END as "Type"&lt;br&gt;
FROM pg_catalog.pg_proc p&lt;br&gt;
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace&lt;br&gt;
WHERE pg_catalog.pg_function_is_visible(p.oid)&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'pg_catalog'&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'information_schema'&lt;br&gt;
ORDER BY 1, 2, 4;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**2. Target: Apache Cloudberry**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT n.nspname as "Schema",&lt;br&gt;
  p.proname as "Name",&lt;br&gt;
  pg_catalog.pg_get_function_result(p.oid) as "Result data type",&lt;br&gt;
  pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",&lt;br&gt;
 CASE p.prokind&lt;br&gt;
  WHEN 'a' THEN 'agg'&lt;br&gt;
  WHEN 'w' THEN 'window'&lt;br&gt;
  WHEN 'p' THEN 'proc'&lt;br&gt;
  ELSE 'func'&lt;br&gt;
 END as "Type"&lt;br&gt;
FROM pg_catalog.pg_proc p&lt;br&gt;
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace&lt;br&gt;
WHERE pg_catalog.pg_function_is_visible(p.oid)&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'pg_catalog'&lt;br&gt;
      AND n.nspname &amp;lt;&amp;gt; 'information_schema'&lt;br&gt;
ORDER BY 1, 2, 4;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### cbcopy Parameters Reference

(Refer to the cbcopy parameter documentation and examples for complete usage and configuration guidance.)



## Welcome to Apache Cloudberry:

- **Visit the website:** https://cloudberry.apache.org
- **Follow us on GitHub:** https://github.com/apache/cloudberry
- **Join Slack workspace:** https://apache-cloudberry.slack.com
- **Dev mailing list:**
  - To subscribe to dev mailing list: Send an email to `dev-subscribe@cloudberry.apache.org`
  - To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>postgres</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Unlocking the Power of Dynamic Tables: A Thanksgiving Transformation</title>
      <dc:creator>Maya S.</dc:creator>
      <pubDate>Fri, 28 Nov 2025 07:41:50 +0000</pubDate>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3/unlocking-the-power-of-dynamic-tables-a-thanksgiving-transformation-5l1</link>
      <guid>https://dev.to/maya_sun_29e7bf629e5dd7b3/unlocking-the-power-of-dynamic-tables-a-thanksgiving-transformation-5l1</guid>
      <description>&lt;p&gt;Apache Cloudberry is an advanced and mature open-source Massively Parallel Processing (MPP) database, derived from the open-source version of the Pivotal Greenplum Database® but built on a more modern PostgreSQL kernel and with more advanced enterprise capabilities. Cloudberry can serve as a data warehouse and can also be used for large-scale analytics and AI/ML workloads.&lt;/p&gt;

&lt;p&gt;As Thanksgiving approached, Harvest Analytics was gearing up for its biggest sales event of the year. With customers eager to secure the best deals, the team knew they needed real-time visibility into their sales data to drive informed decisions. But they faced a critical challenge: how could they efficiently query streaming data from Kafka while still getting instant access to key metrics?&lt;/p&gt;

&lt;p&gt;The Challenge&lt;/p&gt;

&lt;p&gt;Harvest Analytics relied heavily on a lakehouse architecture, using kafka_fdw to pull streaming data into their Apache Cloudberry database. However, queries on the external data were often sluggish, hampering their ability to respond quickly during peak sales periods. The team needed a solution that could deliver fast, auto-refreshing access to this critical information.&lt;/p&gt;

&lt;p&gt;[图片]&lt;/p&gt;

&lt;p&gt;The Discovery of Dynamic Tables&lt;/p&gt;

&lt;p&gt;One day, while discussing their challenges over coffee, Chief Data Officer Lisa recalled a powerful new feature in Cloudberry: Dynamic Tables. These auto-refreshing materialized views could pull data from base tables, external tables, and even other dynamic tables, automatically optimizing query performance.&lt;br&gt;
Excited by the potential, Lisa gathered her team to explore how Dynamic Tables could revolutionize their data access. With the ability to automatically rewrite user SQL queries to utilize these dynamic tables, the team saw a glimmer of hope. They were particularly drawn to the declarative programming aspect of Dynamic Tables, allowing them to define their pipeline outcomes using straightforward SQL without worrying about the intricacies of the steps involved.&lt;/p&gt;

&lt;p&gt;The Implementation&lt;br&gt;
With a sense of urgency, the Harvest Analytics team quickly set up Dynamic Tables to ingest their Kafka data. They configured the tables to refresh every minute, ensuring that the latest sales data was always available for analysis. The key advantage was how these dynamic tables seamlessly integrated with their existing infrastructure, effectively bridging the gap between external lakehouse data and internal analytics.&lt;/p&gt;

&lt;p&gt;[图片]&lt;br&gt;
CREATE DYNAMIC TABLE dynamic_table_orders SCHEDULE '5 * * * &lt;em&gt;' AS SELECT COUNT(&lt;/em&gt;) AS a FROM foreign_table_orders WHERE amout &amp;gt; 100;&lt;/p&gt;

&lt;p&gt;The Results&lt;/p&gt;

&lt;p&gt;As Thanksgiving Day arrived, the transformation was evident:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Real-Time Insights: Thanks to Dynamic Tables, the team could now perform continuous queries on their Kafka data, aggregating sales metrics every minute. They could visualize total sales in real-time, empowering them to adjust marketing strategies on the fly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automatic SQL Rewriting: When team members queried external data, Cloudberry automatically recognized SQL and rewrote it to utilize the Dynamic Tables. This meant that users could focus on their analysis without worrying about the underlying complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Speed and Efficiency: The performance boost was staggering. Queries that once took minutes now returned results in seconds, allowing Harvest Analytics to react swiftly to customer behavior and maximize sales opportunities.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Seamless Integration: The implementation of Dynamic Tables was smooth and required minimal changes to their existing workflows. The team could continue using their familiar tools while benefiting from the advanced capabilities of Cloudberry.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simplified Pipeline Management: The declarative nature of Dynamic Tables reduced the complexity of their data workflows, allowing the team to focus on outcomes rather than technical details. This simplification meant that even complex data operations became manageable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flexible Data Pipelines: With transparent orchestration, the team could easily construct pipelines tailored to their needs, ensuring that data was always up-to-date and ready for analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Impact&lt;/p&gt;

&lt;p&gt;As the day unfolded, Harvest Analytics experienced record-breaking sales. Their ability to monitor and respond to trends in real-time transformed their Thanksgiving event into a resounding success. The team celebrated not only their sales achievements but also the newfound power of Dynamic Tables.&lt;/p&gt;

&lt;p&gt;Encouraged by their success, Harvest Analytics shared their story with other companies in the industry. They highlighted how Cloudberry’s Dynamic Tables had changed the game, allowing them to run queries on external Kafka data as swiftly as if it were internal.&lt;/p&gt;

&lt;p&gt;A Call to Action&lt;/p&gt;

&lt;p&gt;If your organization is grappling with the same challenges as Harvest Analytics, it’s time to unlock the potential of Dynamic Tables in Apache Cloudberry. Experience the benefits of auto-refreshing materialized views, seamless SQL rewriting, and lightning-fast queries on lakehouse data. &lt;/p&gt;

&lt;p&gt;Join the growing community of Cloudberry users who are transforming their data strategies and driving success in their businesses. Don’t let external data slow you down—embrace Dynamic Tables and watch your insights flourish this Thanksgiving and beyond!&lt;/p&gt;

&lt;p&gt;Welcome to Apache Cloudberry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visit the website: &lt;a href="https://cloudberry.apache.org" rel="noopener noreferrer"&gt;https://cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Follow us on GitHub: &lt;a href="https://github.com/apache/cloudberry" rel="noopener noreferrer"&gt;https://github.com/apache/cloudberry&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Join Slack workspace: &lt;a href="https://apache-cloudberry.slack.com" rel="noopener noreferrer"&gt;https://apache-cloudberry.slack.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dev mailing list:

&lt;ul&gt;
&lt;li&gt;To subscribe to dev mailing list: Send an email to &lt;a href="mailto:dev-subscribe@cloudberry.apache.org"&gt;dev-subscribe@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To browse past dev mailing list discussions: &lt;a href="https://lists.apache.org/list.html?dev@cloudberry.apache.org" rel="noopener noreferrer"&gt;https://lists.apache.org/list.html?dev@cloudberry.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>基于 Cloudberry 的全文检索方案对比：ParadeDB BM25 vs GIN vs ZomboDB</title>
      <dc:creator>Maya S.</dc:creator>
      <pubDate>Fri, 14 Nov 2025 09:10:08 +0000</pubDate>
      <link>https://dev.to/maya_sun_29e7bf629e5dd7b3/ji-yu-cloudberry-de-quan-wen-jian-suo-fang-an-dui-bi-paradedb-bm25-vs-gin-vs-zombodb-3jfk</link>
      <guid>https://dev.to/maya_sun_29e7bf629e5dd7b3/ji-yu-cloudberry-de-quan-wen-jian-suo-fang-an-dui-bi-paradedb-bm25-vs-gin-vs-zombodb-3jfk</guid>
      <description></description>
    </item>
  </channel>
</rss>
