Maya S.

Posted on Mar 20

Apache Cloudberry 2.0: Rebuilding Storage for the Cloud-Native Era with PAX

#apachecloudberry #bigdata

Rethinking AOCS: When Architecture Meets a New Infrastructure Reality

From a Solid Design to a Structural Mismatch

The AO/AOCS storage engine, inherited from Greenplum, was originally built for on-premises environments. Its design—column-per-file with append-only writes—worked well on block storage and traditional file systems, delivering stable performance for OLAP workloads.
But the infrastructure landscape has changed.
As storage shifts toward cloud-native object storage, the assumptions behind AOCS no longer hold. Object storage favors large, sequential I/O and request aggregation, while AOCS relies on independent column files and frequent small appends. The result is not just inefficiency—it is a structural mismatch.
In real-world workloads, this manifests as:

Exploding request counts when scanning wide tables (one request per column per file)
Severe request amplification due to unmerged small writes
Degraded sequential read performance caused by fragmented column layouts At the same time, tight kernel coupling and limited thread-safety make it difficult to fully leverage multi-threading and vectorized execution. What used to be a reasonable design has now become a constraint—not just on performance, but on the system’s ability to evolve. Why Incremental Fixes Were Not Enough Extensive stress testing revealed a clear pattern: the bottleneck was not localized—it was systemic. Tuning parameters, improving caches, or adding execution-layer optimizations helped, but only marginally. The core issue remained: the storage model itself was not aligned with the cloud environment. Continuing to patch AOCS would only introduce more layers of complexity and technical debt. The conclusion was straightforward: instead of adapting a legacy design, Cloudberry needed a storage engine built for object storage from the ground up. This led to the introduction of PAX.

PAX: A Storage Model Designed for the Cloud

PAX is not just a replacement for AOCS. It is a redefinition of how storage should work in a cloud-native data warehouse—balancing analytical performance, transactional needs, and long-term evolvability.
A New Paradigm: Row–Column Co-existence
Traditional database systems force a trade-off:

Row storage → optimized for transactions
Column storage → optimized for analytics PAX removes this dichotomy. Within the same physical file and logical block, PAX organizes data in a columnar layout while preserving row-level access semantics. This hybrid design enables:
Efficient analytical scans by reading only required columns
Merged multi-column writes to reduce small-file pressure on object storage
Shared file structures across columns, significantly reducing request overhead The result is a storage model that performs consistently across mixed OLTP + OLAP workloads, which is increasingly common in modern data platforms.

A Layered Architecture Built for Evolution

PAX adopts a strictly layered design to ensure modularity and long-term extensibility:

Access Handler Layer Integrates with Cloudberry’s Access Method (AM), handling transactions and lifecycle management.
Table Layer Bridges execution engines and storage, supporting both row-based and vectorized execution.
MicroPartition Layer Manages physical data organization (files and stripes), including statistics and pruning logic.
Column Layer Defines in-memory column structures, handling encoding, decoding, and alignment.
File Layer Encapsulates storage interactions, including data files, metadata, and visibility maps. This separation of concerns allows PAX to evolve independently at each layer, paving the way for features like multi-threaded execution and distributed transactions.

Metadata Management: A Lightweight Control Plane for Storage

PAX adopts a lightweight yet effective metadata management strategy based on auxiliary tables built on the Heap Access Method (Heap AM).

Each physical data file corresponds to a single record in the auxiliary table. This mapping provides a consistent control plane for storage, enabling the engine to:

Quickly locate data files
Track file lifecycle changes
Evaluate transactional visibility efficiently The auxiliary table maintains essential metadata such as file identifiers, states, and visibility-related attributes, ensuring that storage operations remain both predictable and low-overhead. In addition, PAX maintains a global fast sequence table to generate unique BLOCKNAMEs, guaranteeing globally unique file naming across nodes and transactions. More importantly, this mechanism serves as the foundation for associating Visimap files with their corresponding data files, ensuring correctness and consistency in distributed visibility control.

Rethinking MVCC for Object Storage
Traditional MVCC in PostgreSQL relies on row-level versioning. In object storage, this approach becomes prohibitively expensive due to excessive I/O and metadata operations.
PAX introduces a file-level visibility model.
Instead of tracking visibility per row, PAX uses Visimap files to represent visibility at the file level:
.visimap
This enables:

Lock-free concurrent reads
Minimal metadata overhead
Efficient visibility checks at read time It’s a fundamental shift that aligns concurrency control with the realities of object storage.

PORC_VEC: When Storage Becomes Execution

One of the most impactful innovations in PAX is PORC_VEC (PostgreSQL ORC Vectorized).
In traditional systems, data must be transformed into a vectorized format before execution—incurring CPU and memory overhead. PORC_VEC eliminates this step entirely.
Key characteristics:

Zero-copy reads: data is consumed directly by the execution engine
Cache-aligned layout: optimized for modern CPU architectures
Unified metadata model: aligned with in-memory column structures This leads to a powerful principle: The storage format is the execution format. In internal tests, PORC_VEC reduces CPU usage by ~20% and improves query throughput by 15–25%.

Column Layer: Bridging Storage and Execution

The Column layer serves as the core in-memory abstraction for columnar data in PAX, bridging persistent storage and the execution engine.

It is responsible for both data representation and transformation, with a design centered on efficiency, flexibility, and alignment with vectorized execution:

Disk-to-Memory Mapping Column loads column data from disk into memory and flushes in-memory data back to storage during write operations.
Format Transformation It performs efficient format conversion along read and write paths, ensuring consistency between on-disk and in-memory layouts while minimizing overhead.
Encoding and Compression Multiple techniques—such as RLEv2, dictionary encoding, and ZSTD—are integrated to reduce storage footprint without sacrificing query performance.
Flexible Access Interfaces
- Row-level interfaces for transactional workloads
- Batch-oriented interfaces for analytical and vectorized execution
Memory Alignment and Complex Type Optimization
- Memory layout follows CPU cache alignment principles to improve access efficiency
- Complex types (e.g., arrays and range types) adopt independent alignment and offset control to reduce parsing overhead With these design choices, the Column layer balances performance, memory efficiency, and concurrency scalability, while providing a solid foundation for vectorized execution and parallel scanning.

Performance Foundations: Four Key Mechanisms

PAX’s performance gains are not accidental—they are the result of deliberate architectural choices.

Sparse Filtering
By maintaining min/max statistics and Bloom Filters at file and stripe levels, PAX can aggressively prune irrelevant data.
Example:
A query like WHERE age < 18 skips entire data blocks where min(age) > 18.
This reduces I/O requests by over 60% on average, bringing object storage performance closer to in-memory systems.
Intelligent Physical Layout (Cluster)
PAX aligns physical data layout with query patterns through automatic clustering:
Z-Order → optimized for multi-dimensional range queries
Lexical Order → optimized for multi-column filtering
This improves data locality and significantly reduces random I/O.
Modern Memory Management
PAX evolved through three stages of memory management, ultimately adopting:
Smart pointers (unique_ptr, shared_ptr)
Thread-aware resource management
This ensures:
No memory leaks under high concurrency
Safe cleanup during early exits or failures
Stable behavior in multi-threaded vectorized execution

Benchmark Results: Quantifying the Gains

In 1TB TPC-H and TPC-DS benchmarks:

Average performance improvement: 15%–25%
Complex queries (joins, aggregations): up to 40% faster These gains come from:
Reduced I/O amplification
Lower CPU overhead via zero-copy execution
More stable latency under complex workloads

Closing Thoughts: Engineering for the Real World

PAX reflects a deliberate shift in engineering philosophy:
Not optimizing around constraints—but removing them.
By aligning storage design with object storage characteristics, and tightly integrating execution with data format, PAX establishes a foundation that is both high-performance and future-proof.
Looking ahead, Cloudberry will continue to evolve PAX with:

Delta storage for incremental updates
Deeper optimizer integration
SIMD-accelerated execution
Adaptive, self-tuning statistics All with a single goal: to make Cloudberry a truly，continuously evolving data platform. Welcome to Apache Cloudberry:
Visit the website: https://cloudberry.apache.org
Follow us on GitHub: https://github.com/apache/cloudberry
Join Slack workspace: https://apache-cloudberry.slack.com
Dev mailing list:
- To subscribe to dev mailing list: Send an email to dev-subscribe@cloudberry.apache.org
- To browse past dev mailing list discussions: https://lists.apache.org/list.html?dev@cloudberry.apache.org

DEV Community