Get Data Lakehouse Books:
- Apache Iceberg: The Definitive Guide
- Apache Polaris: The Definitive Guide
- Architecting an Apache Iceberg Lakehouse
- The Apache Iceberg Digest: Vol. 1
Lakehouse Community:
- Join the Data Lakehouse Community
- Data Lakehouse Blog Roll
- OSS Community Listings
- Dremio Lakehouse Developer Hub
This week the Apache lakehouse ecosystem continued building on Polaris's graduation milestone while pressing forward on format evolution. Iceberg's efficient column update proposal drew a dedicated community sync on March 4, the REST spec remote signing vote closed with strong support, and Parquet's ALP encoding spec moved closer to final review. The week's threads signal that AI/ML workloads are now a first-class design consideration across all four projects.
Apache Iceberg
Efficient Column Updates Sync — March 4
The most active thread this week was the ongoing discussion of efficient column updates for wide tables. Anurag Mantripragada organized a dedicated community sync on Wednesday, March 4, to advance the proposal. The problem: tables with thousands of columns, common in feature stores and ML pipelines, suffer severe write amplification when current Copy-on-Write and Merge-on-Read operations update even a small subset of columns. The proposed approach would write only updated columns to separate column files and stitch them at read time.
Gábor Kaszab noted that full column updates and partial file-level column updates represent a reasonable middle ground between performance improvement and implementation complexity. His proof of concept PR (#15445) shows the metadata and API surface is manageable. The community is building on the V4 architecture foundation and the single-file commit proposal.
Follow the efficient column updates thread
REST Spec Remote Signing Vote Passes (Attempt #2)
Alexandre Dutra's second vote attempt to promote the remote signing endpoint to the main Iceberg REST spec received strong support, with multiple binding and non-binding +1 votes from community members including Eduard Tudenhöfner and Dmitri Bourlatchkov. The change moves the signing endpoint to a table-scoped path (/v1/{prefix}/namespaces/{namespace}/tables/{table}/sign), adds an optional provider parameter for future multi-cloud support, and deprecates the old S3-specific spec. This vote resolves a long-running discussion about making credential signing cloud-agnostic.
V4 Metadata Discussions Continue
Anton Okolnychyi's thread on making the root metadata.json file optional in Iceberg V4 remains active. The core tension is between streaming write performance (where writing metadata.json on every commit creates bottlenecks with HMS and Hadoop catalog backends) and backward compatibility for engines and tools that read the file directly from storage. Two design paths remain under discussion: letting catalogs skip writing the file entirely, or offloading parts of it to external files. Formal proposals are expected in the coming weeks.
Snapshot Expiration Race Condition
Krutika Dhananjay flagged a concurrency bug in Iceberg's snapshot expiration logic this week. A race window exists between when the ExpireSnapshots job computes candidate snapshots and when the commit runs. A concurrent ref addition during that window can cause the maintenance job to remove a live snapshot. The iceberg-go project already has a fix. Amogh Jahagirdar asked for a reproducible test case before confirming the same issue exists in the Java implementation.
Apache Polaris
First Post-Graduation PMC Activity
Apache Polaris officially became a top-level ASF project on February 18, and the new PMC is now operating independently. Community discussion this week focused on what the first major governance actions will look like. The roadmap items drawing the most attention are credential vending expansion for non-AWS storage backends, deeper Delta Lake support through the Generic Table API, and idempotent commit operations for retry-safe catalog writes.
The project co-created by Dremio now spans multi-engine support across Apache Spark, Apache Flink, Trino, StarRocks, Apache Doris, and Dremio. As the open alternative to proprietary catalogs like AWS Glue and Databricks Unity Catalog, Polaris's independent governance is a significant milestone for the open lakehouse ecosystem.
Official graduation announcement
Credential Vending Roadmap
One of the most-watched items for the new PMC is expanding credential vending beyond AWS. The project's STS-based credential vending has been a production feature for Iceberg REST catalog integrations, but non-AWS users have been waiting for equivalent support. Azure and GCS backends are the primary targets. Dev list discussions indicate this work is likely to be one of the first formal proposals under the new governance structure.
Apache Arrow
IPC Stream Multiplexing Design
Arrow's dev list continued a technical discussion on IPC stream multiplexing. Rusty Conover pushed back on using QUIC as the transport layer for multi-schema IPC streams, citing a requirement for explicit ordering across batches from different logical streams. QUIC is optimized for independent delivery across streams, which conflicts with use cases that need strict ordering guarantees. The thread is a narrow but technically significant look at Arrow's design choices as it scales toward more complex multi-schema interleaving scenarios.
Google Summer of Code 2026
Arrow is participating in GSoC 2026. A prospective contributor named Prasanna expressed interest in contributing this week. Engineers interested in mentoring should watch the dev list for the formal program announcement. Arrow has historically used GSoC as a pipeline for new contributors across its C++, Python, Java, and Go implementations.
Security Model Documentation
Arrow's PMC published a formal security model earlier this month. The document clarifies how Arrow handles security considerations across its library implementations. Its publication reflects the broader pattern of organizational maturity that has defined the Apache lakehouse ecosystem in early 2026, with Polaris graduating, Iceberg formalizing contribution guidelines, and Parquet adding a new PMC member in Andrew Lamb.
Apache Parquet
ALP Encoding Spec Final Review
The Adaptive Lossless floating-Point (ALP) encoding spec continued moving through its final review cycle this week. ALP compresses floating-point columns more efficiently than existing methods, which matters directly for ML feature tables and financial datasets where float columns dominate. Contributors are debating whether the finalized spec should land as a pull request directly against the parquet-format repository. Julien Le Dem called for remaining reviewers to comment before the spec can be formally accepted.
Parquet Java 1.17.0 Production Adoption
Parquet Java 1.17.0, released on January 13, is now moving through broader production adoption. This version drops Java 8 support and sets Java 11 as the new minimum runtime. Teams running Java 8 JVMs need a migration plan. Apache Iceberg, Trino, and Spark all depend on Parquet Java for core read and write paths, making this a coordinated upgrade across the broader ecosystem.
Cross-Project Themes
Two themes defined the Apache lakehouse ecosystem this week. The first is AI/ML as a first-class workload. The Iceberg efficient column updates proposal is explicitly designed for feature stores and vector databases. Parquet's ALP encoding targets the floating-point columns that ML workloads generate. Polaris's credential vending roadmap supports the multi-cloud environments where AI infrastructure is deployed. The format-level work happening across all four projects is converging on the same direction: the open lakehouse stack needs to handle wide-table ML patterns without write amplification or metadata bottlenecks.
The second theme is governance maturity translating into focused roadmaps. Polaris's graduation gives it independent authority to set priorities. Iceberg's V4 discussions are producing formal sync meetings and dedicated design docs. Parquet's ALP spec is in final review. These are not speculative discussions — they are active engineering work with clear timelines and community accountability.
Looking Ahead
The Iceberg efficient column updates sync on March 4 should produce a clearer proposal structure in the coming weeks. The V4 metadata.json thread is likely to move from debate to draft proposal. Parquet's ALP encoding spec needs its final review pass before it can merge. Polaris's first major PMC decisions will set the tone for the project's independent governance. Arrow's GSoC mentor recruitment is the near-term community action item to watch.
Top comments (0)