DEV Community

Cover image for Apache Data Lakehouse Weekly: March 3–10, 2026
Alex Merced
Alex Merced

Posted on

Apache Data Lakehouse Weekly: March 3–10, 2026

The Apache lakehouse ecosystem had another active week. AI and ML workloads continued to drive design decisions across all four projects. Polaris is settling into its new life as a top-level Apache project, Iceberg is wrestling with wide-table write patterns that feature stores demand, Arrow is sharpening its IPC transport options, and Parquet's ALP floating-point encoding is moving toward a final vote. The open lakehouse stack is growing up fast.

Apache Iceberg

Efficient Column Updates: From Sync to Proposal

The biggest Iceberg conversation this week followed a dedicated community sync on March 4. Anurag Mantripragada organized the call to push forward the efficient column updates proposal. The problem is real: tables with thousands of columns, common in ML feature stores, suffer severe write amplification when Copy-on-Write and Merge-on-Read operations touch even a small number of columns. The proposed fix writes only updated columns to separate column files and stitches them together at read time.

Gábor Kaszab presented a proof-of-concept pull request (#15445) showing that the metadata and API surface are manageable. The design builds on the V4 architecture foundation and the single-file commit work already underway. Full column file updates and partial file-level updates appear to be the most practical middle ground between implementation complexity and performance gain. Formal proposals are expected in coming weeks.

See the efficient column updates thread

REST Spec Remote Signing Vote Passes

Alexandre Dutra's second vote attempt on remote signing landed cleanly this week. The change promotes the remote signing endpoint into the main Iceberg REST spec, moves it to a table-scoped path, adds an optional provider parameter for multi-cloud support, and deprecates the old S3-specific spec. Multiple binding +1 votes came in from Eduard Tudenhöfner and Dmitri Bourlatchkov, among others. This vote closes a long-running discussion about making credential signing cloud-agnostic. Non-AWS lakehouse deployments benefit directly.

See the vote thread

V4 Metadata: The Optional Root File Debate

Anton Okolnychyi's thread on making the root metadata.json file optional in Iceberg V4 is still open. The tension is between streaming write performance, where writing metadata.json on every commit creates bottlenecks with HMS and Hadoop catalog backends, and backward compatibility for engines that read the file directly from storage. Two approaches remain: let catalogs skip the file entirely, or offload sections to external files. No winner yet, but the discussion is tightening.

Snapshot Expiration Race Condition

Krutika Dhananjay flagged a concurrency bug in Iceberg's snapshot expiration logic. A race window exists between when ExpireSnapshots computes candidate snapshots and when the commit runs. A concurrent ref addition during that window can cause the job to delete a live snapshot. The iceberg-go project already carries a fix. Amogh Jahagirdar asked for a reproducible test case before confirming the issue exists in the Java implementation.

Apache Polaris

Post-Graduation PMC Takes Shape

Apache Polaris officially graduated to a top-level Apache Software Foundation project on February 18. The new PMC is now setting its own priorities. The roadmap items drawing the most discussion this week are credential vending expansion beyond AWS, deeper Delta Lake support through the Generic Table API, and idempotent commit operations for retry-safe catalog writes.

Polaris now serves multi-engine environments spanning Apache Spark, Apache Flink, Trino, StarRocks, Apache Doris, and Dremio. As the open alternative to proprietary catalogs like AWS Glue and Databricks Unity Catalog, independent governance matters. This graduation signals that the open lakehouse catalog layer is no longer an experiment.

Official Polaris graduation announcement

Credential Vending for Non-AWS Backends

The most-watched near-term work item is credential vending beyond AWS STS. Azure and GCS backends are the primary targets. Iceberg REST catalog integrations have relied on STS-based credential vending for AWS deployments, but non-AWS environments have been waiting. Dev list discussions indicate the new PMC will treat this as a first-order proposal under its independent governance. Expect a formal thread soon.

Apache Arrow

IPC Stream Multiplexing: QUIC vs. Ordered Transport

Arrow's dev list carried a focused technical debate this week on IPC stream multiplexing. Rusty Conover pushed back on using QUIC as the transport layer for multi-schema IPC streams. His concern is that QUIC is designed for independent delivery across streams, which conflicts with use cases that require strict ordering guarantees across batches from different logical streams. The thread highlights Arrow's evolving role as a transport layer, not just an in-memory format, and the design tradeoffs that come with that.

Security Model Formalized

Arrow's PMC published a formal security model this month. The document clarifies how Arrow handles security considerations across its library implementations. This publication fits the broader pattern of governance maturity across the lakehouse ecosystem in early 2026. Projects are translating informal practices into written policy.

GSoC 2026 Participation

Arrow is participating in Google Summer of Code 2026. The project is accepting student proposals across its core libraries. Contributors interested in columnar format work, compute kernels, or language bindings have an opening here.

Apache Parquet

ALP Encoding Moves Toward Final Vote

The Adaptive Lossless floating-Point (ALP) encoding spec is in its final review cycle. ALP compresses floating-point columns more efficiently than existing Parquet encodings. That matters most for ML feature tables and financial datasets where float columns dominate. Contributors are debating whether the finalized spec should land as a direct pull request against the parquet-format repository. Julien Le Dem called for remaining reviewers to comment before the spec can be formally accepted. A vote appears close.

Cross-Project Themes

Two themes ran through this week's discussions across all four projects. The first is AI and ML workloads as a design constraint, not an afterthought. Iceberg's efficient column updates proposal targets feature stores and vector databases directly. Parquet's ALP encoding targets the float columns that ML workloads generate. Polaris's credential vending roadmap supports the multi-cloud environments where AI infrastructure runs. The format-level work is converging on a clear direction: the open lakehouse stack must handle wide-table ML patterns without write amplification or metadata overhead.

The second theme is governance maturity producing focused roadmaps. Polaris has independent PMC authority to set its priorities. Iceberg's V4 discussions are producing formal community syncs and design documents. Parquet's ALP spec is close to a formal vote. Arrow's security model is now written down. The projects are moving from informal consensus to structured decision-making.

Looking Ahead

Watch for Polaris's first formal proposals under its new PMC, particularly around credential vending for Azure and GCS. Expect the Iceberg efficient column updates thread to produce a written design proposal after the March 4 sync. The Parquet ALP vote should close soon. Arrow's GSoC 2026 student proposals are a signal of where new contributors are looking to engage.


Start Building on the Open Lakehouse with Dremio

If you want to query your Iceberg tables across multiple clouds without managing complex infrastructure, start your free 30-day Dremio trial and see the Agentic Lakehouse Platform in action.

Dremio runs Apache Polaris as its open catalog, supports Apache Iceberg natively, and lets you query across lakehouses, data warehouses, and databases without copying data. It is the fastest way to turn your open data into production-ready analytics.

Top comments (0)