Apache Data Lakehouse Weekly: April 9–15, 2026

#dataengineering #news #database #opensource

The Iceberg Summit wrapped in San Francisco, leaving behind a set of in-person alignments that are now surfacing as concrete proposals on the dev lists. Parquet's ALP encoding vote closed, Polaris 1.4.0 planning accelerated, and Arrow's engineering community tackled two interlinked decisions about its future Java baseline and AI tooling policy. The post-summit week is when talk becomes code.

Apache Iceberg

The two days in San Francisco established alignment on the discussions that have dominated the dev list all spring. The V4 metadata.json optionality thread drew the largest in-person audience of any design session, with Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu working through the portability and static-table implications of making the root JSON file optional when a catalog manages metadata state. The direction that emerged favors catalog-managed metadata as a first-class supported mode, with portability guarantees preserved through explicit opt-in semantics rather than the current default assumption.

The one-file commits design — the work Russell Spitzer and Amogh Jahagirdar have been advancing through multiple proposals — is heading toward a formal spec write-up following alignment reached at the summit. The approach replaces manifest lists with root manifests and uses manifest delete vectors to enable single-file commits, promising dramatic reductions in commit latency and metadata storage footprint. This is one of the most consequential V4 changes for high-frequency write workloads, and the in-person sessions cleared the remaining design disagreements about inline versus external manifest delete vectors.

Péter Váry's efficient column updates proposal for AI and ML workloads drew real interest at the summit. The design targets wide tables where only a subset of columns change on each write — embedding vectors, model scores, feature values — allowing Iceberg to write only the updated columns to separate files and merge at read time. For teams managing petabyte-scale feature stores, the I/O savings are significant. Péter indicated that a formal proposal with POC benchmarks would land on the dev list in the days following the summit.

The AI contribution policy that pulled in Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun over the preceding weeks moved toward practical resolution. The summit provided the in-person clarity that async debate rarely does, and a working policy covering disclosure requirements and code provenance standards for AI-generated contributions is expected to be published on the dev list this week.

Apache Polaris

Polaris is one month past its February 18 graduation as a top-level Apache project, and the governance machinery is running. Jean-Baptiste Onofré's first board report as a TLP covers the March 26 ASF board meeting, documenting community health, development progress, and strategic direction under Polaris's own PMC. JB also joined the Apache Software Foundation board itself as a Dremio-nominated director, a governance milestone that deepens the open-source commitment across the entire ecosystem.

The Apache Ranger authorization RFC from Selvamohan Neethiraj remained the most active technical discussion thread. The design allows organizations running Ranger alongside Hive, Spark, and Trino to manage Polaris security within a unified governance framework, eliminating the policy duplication that arises when teams bolt separate authorization systems onto each engine. The plugin is opt-in and backward compatible with Polaris's existing internal authorization layer, a design choice that lowers the enterprise adoption barrier considerably.

The 1.4.0 release — Polaris's first as a graduated project — is now in active scope finalization. Credential vending for Azure and Google Cloud Storage is the headline feature, alongside catalog federation design that lets Polaris front for multiple catalog backends in multi-cloud deployments. With incubator overhead behind it, release velocity is expected to accelerate. Watch the dev list this week for a 1.4.0 milestone thread and vote timeline.

Apache Arrow

Jean-Baptiste Onofré's thread proposing JDK 17 as the minimum version for Arrow Java 20.0.0 is approaching decision. Contributors including Micah Kornfield and Antoine Pitrou have been weighing in, and the practical rationale is compelling: setting JDK 17 as the floor would align Arrow's Java modernization with Iceberg's own upgrade timeline, effectively raising the minimum across the entire lakehouse stack in a single coordinated move. The decision is expected to land before the 20.0.0 release cycle formally opens.

The arrow-rs 58.2.0 release was on track for April, following the 58.1.0 shipment in March, which arrived with no breaking API changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a release cadence that matches growing adoption in query engines that want Arrow's columnar format without a JVM dependency.

Nic Crane's thread on using LLMs for Arrow project maintenance continued to generate thoughtful discussion. The framing — AI as a resource for maintainers rather than just contributors — is distinct from how Iceberg and Polaris are approaching the same question. Arrow's angle is practical: a lean maintainer group managing a growing issue backlog needs help triaging, and LLMs can do that work without introducing the code-provenance concerns that matter for contributions. Google Summer of Code 2026 student proposals arrived this week, with interest concentrated in compute kernels and language bindings for Go and Swift, adding bandwidth to a project that will need it as the 20.0.0 cycle opens.

Apache Parquet

The ALP (Adaptive Lossless floating-Point) encoding specification vote closed this week, marking one of the most meaningful additions to the Parquet specification in recent memory. ALP encodes floating-point exponents and mantissas separately, delivering significantly better compression ratios for float-heavy columns. The practical beneficiaries are ML feature stores and scientific computing workloads, where columns full of embedding coordinates and model outputs are common. Months of careful spec review paid off.

The Variant type that shipped in February has been generating follow-on integration discussion across engine teams. Spark, Trino, and Dremio contributors compared notes on their implementation experiences this week, working through edge cases in semi-structured data handling that the spec leaves partially open. Getting these implementations to converge matters: Parquet's value as a cross-engine format depends on consistent behavior, and Variant is novel enough that divergence between engines would fragment the ecosystem.

The File logical type proposal — which would allow Parquet files to natively embed unstructured data including images, PDFs, and audio as columnar records — continued advancing through community discussion. Alongside Variant, this proposal signals a deliberate effort to evolve Parquet from a purely analytical format into a unified storage layer capable of managing the diverse data shapes that AI and ML pipelines produce. The direction is ambitious and the community engagement is substantive.

Cross-Project Themes

The post-summit week is when the conversations that happened in person translate back into the formal proposals and vote threads that actually change the projects. Across all four lists, expect the next two weeks to be among the most active of 2026 as in-person alignments hit the dev list in concrete form.

The second theme connecting all four projects is the deliberate expansion of format scope to meet AI workload demands. Parquet's ALP acceptance, the File logical type proposal, Iceberg's efficient column updates for wide ML tables, Polaris's Ranger integration and federation work, and Arrow's JDK 17 modernization are all responses to the same underlying pressure: the lakehouse stack is being asked to power AI pipelines, not just analytical queries. The pace of that evolution is accelerating, and the summit put the community's roadmap on the same page.

Looking Ahead

Watch the Iceberg dev list for the V4 metadata optionality formal proposal, the single-file commits spec write-up, and a published AI contribution policy. The Polaris 1.4.0 milestone thread and vote timeline should also land this week. Arrow's JDK 17 decision for Java 20.0.0 will likely follow close behind. The summit session recordings will appear on YouTube in the weeks ahead — an excellent resource for anyone who missed San Francisco.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads