Rethinking Stream-Batch Unification: Real-Time Processing with Incremental Materialized Views in Apache Cloudberry

#database #dataengineering #postgres

Apache Cloudberry is an advanced and mature open-source Massively Parallel Processing (MPP) database, derived from the open-source version of the Pivotal Greenplum Database® but built on a more modern PostgreSQL kernel and with more advanced enterprise capabilities. Cloudberry can serve as a data warehouse and can also be used for large-scale analytics and AI/ML workloads.

In today’s data-driven landscape, “real-time” capabilities have become a business imperative. Every company wants to detect changes instantly and respond to user needs as they happen. Streaming engines such as Apache Flink, with their powerful capabilities and ultra-low latency, have set a compelling vision for what real-time data processing can achieve.
Yet the reality is often far more complicated. For many organizations — especially those without large, specialized engineering teams — building and maintaining a Flink-based, stream-batch unified platform can be both powerful and painful. You gain real-time insights, but only by accepting significant architectural complexity and operational overhead.
Is there a simpler, more elegant path to stream-batch unification?
Yes — and it has become increasingly practical.
With the rise of modern database technologies, solutions such as Incremental Materialized Views (IVM) in Apache Cloudberry are emerging as a cleaner, lighter alternative: in-database stream processing.

The “Heavyweight” Approach: The Power and Pain of Flink
A Flink-centered architecture is undoubtedly powerful, but it also comes with several burdens:

Complex architecture and high operational costs A typical pipeline stitches together many components — applications, MySQL, CDC tools, Kafka, Flink, and a data warehouse or data lake. Each component requires specialized expertise, and a failure in any part can break the entire chain.
High development overhead In the classic Lambda architecture, teams must maintain two separate codebases — one for streaming (Flink) and one for batch (Spark or Hive). That means double the logic, double the testing, and a persistent risk of inconsistency.
Steep learning curve Mastering Flink is non-trivial. State management, time semantics, watermarks, windowing, and performance tuning demand deep expertise and continuous operational effort — something many teams cannot afford.

Simplifying Stream-Batch Processing Inside the Database
Cloudberry takes a bold yet simple approach:
Why not let the database itself handle streaming computation?
This is the essence of in-database stream-batch unification, powered by Incremental Materialized Views (IVM).
An IVM functions as a “live” materialized result that automatically stays up to date.

Batch phase: When you run a CREATE INCREMENTAL MATERIALIZED VIEW command, Cloudberry performs a full historical computation to build the initial view — the batch layer.
Stream phase: Subsequent INSERT, UPDATE, and DELETE operations on the source tables are captured automatically. The engine computes only the incremental changes and updates the view in near real time — typically within milliseconds to seconds. This fundamentally simplifies what used to be a complex and error-prone workflow. Previously, teams had to define Kafka message schemas and Flink-specific data structures, and write large amounts of Flink SQL (covering data sources, windows, aggregations, dimension joins, and output tables) just to complete a single task. For example: // Kafka data structure { "sales_id": 8435, "event_type": "+I", "event_time": "2025-06-27 07:53:21Z", "ticket_number": 8619628, "item_sk": 6687, "customer_sk": 69684, "store_sk": 238, "quantity": 6, "sales_price": 179.85, "ext_sales_price": 1079.1, "net_profit": 672, "event_source": "CDC-TO-KAFKA-FIXED" } Before Flink can process streaming data, the data must be persisted to ensure correctness and support replay in case of failures. Therefore, CDC → Kafka → Flink always introduces additional transformation, configuration, and operational complexity. The following Flink SQL illustrates only the streaming computation portion — the full pipeline requires even more components and code: -- Create the TPC-DS store performance aggregation result output table (output to console) CREATE TABLE store_daily_performance ( window_start TIMESTAMP(3), window_end TIMESTAMP(3), s_store_sk INT, s_store_name STRING, s_state STRING, s_market_manager STRING, sale_date STRING, total_sales_amount DECIMAL(10,2), total_net_profit DECIMAL(10,2), total_items_sold BIGINT, transaction_count BIGINT, avg_sales_price DECIMAL(7,2), process_time TIMESTAMP_LTZ(3) ) WITH ( 'connector'='print', 'print-identifier'='TPCDS-STORE-PERFORMANCE' );

-- Core aggregation query
INSERT INTO store_daily_performance
SELECT
window_start,
window_end,
s.ss_store_sk,
COALESCE(sd.s_store_name, CONCAT('Store #', CAST(s.ss_store_sk AS STRING))) AS s_store_name,
COALESCE(sd.s_state, 'Unknown') AS s_state,
COALESCE(sd.s_market_manager, 'Unknown Manager') AS s_market_manager,
DATE_FORMAT(window_start, 'yyyy-MM-dd') AS sale_date,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_ext_sales_price
WHEN s.event_type = '-D' THEN -s.ss_ext_sales_price
ELSE 0 END) AS total_sales_amount,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_net_profit
WHEN s.event_type = '-D' THEN -s.ss_net_profit
ELSE 0 END) AS total_net_profit,
SUM(CASE WHEN s.event_type = '+I' THEN s.ss_quantity
WHEN s.event_type = '-D' THEN -s.ss_quantity
ELSE 0 END) AS total_items_sold,
COUNT(DISTINCT s.ss_ticket_number) AS transaction_count,
AVG(s.ss_sales_price) AS avg_sales_price,
LOCALTIMESTAMP AS process_time
FROM TABLE(
TUMBLE(TABLE sales_events_source, DESCRIPTOR(event_time), INTERVAL '1' MINUTE)
) s
LEFT JOIN store_dim sd ON s.ss_store_sk = sd.s_store_sk
WHERE s.event_type IN ('+I', '-D', 'U')
GROUP BY
window_start,
window_end,
s.ss_store_sk,
sd.s_store_name,
sd.s_state,
sd.s_market_manager;
By contrast, Cloudberry IVM can express the same task in a single SQL statement:
CREATE INCREMENTAL MATERIALIZED VIEW tpcds.store_daily_performance_enriched_ivm
AS
SELECT
ss.ss_store_sk AS store,
s.s_store_name AS store_name,
s.s_state AS state,
s.s_market_manager AS manager,
d.d_date AS sold_date,
SUM(ss.ss_net_paid_inc_tax) AS total_sales_amount,
SUM(ss.ss_net_profit) AS total_net_profit,
SUM(ss.ss_quantity) AS total_items_sold,
COUNT(ss.ss_ticket_number) AS transaction_count
FROM
tpcds.store_sales_heap ss
JOIN
tpcds.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
JOIN
tpcds.store s ON ss.ss_store_sk = s.s_store_sk
GROUP BY
ss.ss_store_sk,
s.s_store_name,
s.s_state,
s.s_market_manager,
d.d_date
DISTRIBUTED BY (ss_store_sk);
All the complexity — state management, consistency handling, incremental computation, scheduling, and triggering — is transparently handled by the database engine. This eliminates the need to orchestrate numerous intermediate streaming jobs and significantly reduces development and operational costs.

The Perfect Pair: Incremental Materialized Views and Dynamic Tables
Cloudberry also provides another mechanism: Dynamic Tables.
While both are types of materialized views, they serve different purposes depending on latency requirements and workload characteristics.
In short:

Choose IVM when you need low latency and immediate updates.
Choose Dynamic Tables when you can tolerate some delay and need to handle large datasets efficiently.

Practical Considerations: Performance and Limitations
No technology is perfect, and IVM is no exception.

Performance overhead: Because IVMs update incrementally on every write, they add some transactional overhead to source tables — especially when multiple IVMs depend on the same table.
Feature limitations: The current version of Cloudberry IVM does not yet support MIN, MAX, window functions, LEFT/OUTER JOIN, CTEs, or partitioned tables. These gaps are actively being addressed by the open-source community.

Conclusion
For top-tier internet companies, investing in large, Flink-based infrastructures makes sense — they can absorb the complexity in pursuit of maximum flexibility and performance.
But most organizations do not need heavyweight systems. They need a simple, reliable, and cost-effective way to gain real-time insights.
Cloudberry’s Incremental Materialized Views provide exactly that:a unified stream-batch processing model built directly into the database, powered by plain SQL, with consistency, simplicity, and efficiency in a single system.
This may well be the most practical path to bringing real-time data capabilities to every enterprise.

Welcome to Apache Cloudberry: