Apache Data Lakehouse Weekly: April 16–22, 2026

#architecture #dataengineering #news #opensource

Two weeks past the Iceberg Summit, the San Francisco in-person alignments are now translating into formal proposals and code on the dev lists. Iceberg's V4 design work continued consolidating, Polaris kept moving toward its 1.4.0 milestone, Parquet's Geospatial spec picked up a cleanup commit from a new contributor, and Arrow's release engineering and Java modernization discussions stayed active.

Apache Iceberg

The post-summit V4 design work continued as the defining thread on the Iceberg dev list this week. The V4 metadata.json optionality discussion that Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu drove through March kept narrowing on practical design questions. The concrete direction emerging from the summit is to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics, rather than the current implicit assumption that the root JSON file is always present.

Russell Spitzer and Amogh Jahagirdar's one-file commits design moved toward a formal spec write-up this week. The approach replaces manifest lists with root manifests and introduces manifest delete vectors, enabling single-file commits that cut metadata write overhead dramatically for high-frequency writers. The in-person sessions at the summit cleared the last design disagreements about inline versus external manifest delete vectors, and the community is now aligning on the implementation plan.

Péter Váry's efficient column updates proposal for AI and ML workloads drew steady engagement. The design lets Iceberg write only the columns that change on each write for wide feature tables, then stitch the result at read time. For teams managing petabyte-scale feature stores with embedding vectors and model scores, the I/O savings are meaningful. Anurag Mantripragada and Gábor Herman are working alongside Péter on POC benchmarks to support the formal proposal.

The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is moving toward published guidance. The summit provided the in-person alignment that async debate rarely produces, and a working policy covering disclosure requirements and code provenance standards for AI-generated contributions is expected on the dev list in the next couple of weeks. Polaris is navigating the same question in parallel, and the two communities are likely to converge on a shared approach given their overlapping contributor base.

Apache Polaris

The Polaris 1.4.0 release is in active scope finalization as the project's first release since graduating to top-level status on February 18. Credential vending for Azure and Google Cloud Storage is the headline feature, alongside catalog federation that lets one Polaris instance front multiple catalog backends across clouds. The schedule-driven release model calls for a release intent email to the dev list about a week before the RC cut, so watch the list for that thread shortly.

The Apache Ranger authorization RFC from Selvamohan Neethiraj remained the most active governance discussion. The plugin lets organizations running Ranger with Hive, Spark, and Trino manage Polaris security within the same policy framework, eliminating the policy duplication that arises when teams bolt separate authorization onto each engine. It is opt-in and backward compatible with Polaris's internal authorization layer, which lowers the enterprise adoption barrier considerably.

On the community side, Polaris's blog continued its post-graduation cadence with a Sunday April 4 post on building a fully integrated, locally-running open data lakehouse in under 30 minutes using k3d, Apache Ozone, Polaris, and Trino. The Polaris PMC also shipped a March 29 post covering automated entity management for catalogs, principals, and roles. With incubator overhead behind it, release velocity has picked up noticeably from the 1.3.0 release on January 16.

Apache Arrow

Arrow's release calendar shows arrow-rs 58.2.0 landing this month, following 58.1.0 in March which shipped with no breaking API changes. The cadence has held at roughly one minor version per month, with 59.0.0 already scheduled for May as a major release that may include breaking changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a DataFusion integration drawing engines that want Arrow without a JVM dependency.

Jean-Baptiste Onofré's JDK 17 minimum proposal for Arrow Java 20.0.0 continued drawing input from Micah Kornfield and Antoine Pitrou. The practical rationale is coordination: setting JDK 17 as Arrow's Java baseline aligns with Iceberg's own upgrade timeline and effectively raises the minimum across the entire lakehouse stack in a single coordinated move. The decision is expected before the 20.0.0 release cycle formally opens.

Nic Crane's thread on using LLMs for Arrow project maintenance continued generating discussion. The framing — AI as a resource for maintainers, not just contributors — is distinct from how Iceberg and Polaris are approaching their AI policies. Arrow's angle is practical: a lean maintainer group managing a growing issue backlog needs help triaging, and LLMs can do that work without introducing the code-provenance concerns that matter for contributions. Google Summer of Code 2026 student proposals that landed in early April are being sorted this week, with interest concentrated in compute kernels and Go and Swift language bindings.

Apache Parquet

Parquet's week centered on hardening the Geospatial spec that was adopted earlier this year. Milan Stefanovic merged PR #560 on April 20, clarifying the Geospatial spec wording for coordinate reference systems. The change documents existing CRS usage practice for the default OGC:CRS84 system and removes ambiguity caught during implementation reviews. Small spec-hardening commits like this are how a new type goes from "shipped" to "production-reliable" across engines.

The community blog effort continued alongside the spec work. The Native Geospatial Types blog that Jia Yu and Dewey Dunnington published on February 13 remains the community's reference explainer, and Andrew Lamb has been coordinating with Aihua Xu on the companion Variant blog post. Spotlighting recent additions through the Parquet blog is part of a deliberate push to give the project the same kind of voice that DataFusion and Arrow have built.

The ALP encoding that cleared its acceptance vote in the prior week moved into implementation discussion. Engine teams across Spark, Trino, Dremio, and DataFusion are comparing notes on how to integrate ALP into their Parquet readers, with compression gains for float-heavy ML feature stores as the immediate benefit. The File logical type proposal for unstructured data (images, PDFs, audio) also kept advancing in community discussion, extending Parquet's scope beyond pure analytics.

Cross-Project Themes

The summit's downstream effect is now visible across every dev list. Iceberg's V4 work, Polaris's 1.4.0 scope, Arrow's JDK 17 decision, and Parquet's Geospatial cleanup are running in parallel, and the cross-project coordination on shared questions like AI contribution policy and Java baselines has intensified. The JDK 17 alignment is the clearest case: moving Arrow Java 20.0.0, Iceberg's next major, and downstream engines to the same floor in a single window removes years of compatibility friction.

The second pattern is the steady expansion of format scope to meet AI workloads. Iceberg's efficient column updates, Parquet's File logical type, the Geospatial spec hardening, and Polaris's multi-cloud federation all respond to the same pressure: the lakehouse stack is being asked to power AI pipelines, not just analytical queries. Each project is making changes that only make sense if you assume the next decade's workloads look different from the last.

Looking Ahead

Watch for the V4 single-file commits formal spec write-up and the metadata optionality vote on the Iceberg dev list, along with a published AI contribution policy. The Polaris 1.4.0 release intent email should land in the coming days. Arrow's JDK 17 baseline decision for Java 20.0.0 is close to a vote, and arrow-rs 58.2.0 should ship before the end of the month. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads