Apache Data Lakehouse Weekly: March 16 – April 2, 2026

#data #dataengineering #news #opensource

The past two and a half weeks have been defined by one looming event: the Iceberg Summit on April 8–9 in San Francisco. As the community counts down to its largest-ever gathering, the dev lists across all four projects have reflected a mix of final-push stabilization, governance maturation, and forward-looking technical proposals. A cross-project debate on AI contribution policies reached its most active phase, Nvidia's GTC conference reshaped the hardware landscape underneath the lakehouse stack, and Polaris continued settling into life as a graduated Apache top-level project.

Apache Iceberg

The Iceberg community spent these weeks in full summit preparation mode. The Iceberg Summit 2026 on April 8–9 at the Marriott Marquis in San Francisco is now days away, and logistics threads accelerated as the selection committee finalized the speaker lineup and agenda. Bloomberg's Sung Yun organized a Pre-Summit Community Meetup for April 7 at Bloomberg's San Francisco office, with lightning talks and networking, a sign of the grassroots energy surrounding the event. Danica Fine voiced enthusiasm for the meetup as a way to ease into the summit.

On the governance side, the remote signing endpoint vote passed during this period, with Alexandre Dutra calling a second attempt that drew binding +1s from Eduard Tudenhöfner and Dmitri Bourlatchkov, along with non-binding support from Christian Thiel. Promoting the remote signing endpoint to the main REST spec is a meaningful step for credential management in multi-cloud Iceberg deployments.

The AI contribution guidelines debate continued to draw community input from Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun. The conversation is converging on practical guardrails: disclosure requirements for AI-generated code, review standards, and how to handle contributions where code provenance is unclear. This policy is likely to be discussed in person at the Summit.

Péter Váry's efficient column updates proposal for wide tables, targeting ML feature stores and vector databases with thousands of columns, moved through multiple community syncs. The core idea of writing only updated columns to separate files and stitching them at read time would dramatically reduce write amplification for AI workloads. Anurag Mantripragada and Gábor Herman have submitted an Iceberg Summit talk on the topic, signaling that this will be a headline discussion at the conference.

The iceberg-rust 0.9.0 release shipped during this period, covering development from early January through early March and representing 109 merged PRs from 28 contributors, including 8 new contributors. The Rust implementation has been shipping at a rapid cadence, this is the fourth Rust release in six months, and its DataFusion integration is making it a serious alternative for teams that want Iceberg without a JVM dependency.

Apache Polaris

Polaris continued settling into its role as an independent top-level Apache project following its February 18 graduation. Jean-Baptiste Onofré circulated the project's first board report as a TLP covering the March 26 board meeting, a governance milestone documenting community health, development progress, and strategic direction under Polaris's own PMC.

Selvamohan Neethiraj opened an RFC for an Apache Ranger authorization plugin, proposing enterprise-grade policy integration for Polaris. The design allows organizations already running Ranger alongside Hive, Spark, and Trino to manage Polaris security within their existing governance framework, addressing policy duplication and role explosion through Ranger's attribute-based access control. The plugin is opt-in and preserves backward compatibility with Polaris's existing internal authorization.

The catalog federation discussion that launched in the prior period continued this cycle. The design would allow Polaris to federate across multiple catalog instances in multi-cloud deployments, a capability enterprise users running Iceberg across AWS, Azure, and GCS have been requesting. The 1.4.0 release, which will be Polaris's first release as a graduated project, remains in active planning with credential vending for Azure and GCS backends as the headlining feature.

Like Iceberg, Polaris is navigating the AI contribution guidelines question. The two communities are likely to produce coordinated policies, reflecting their shared contributor base and overlapping governance sensibilities.

Apache Arrow

Arrow's dev list focused on release engineering and Java modernization during this period. The Arrow Go v18.5.2 patch release shipped on March 4, covering 16 commits from 6 contributors with fixes from Matt Topol for binary builder handling, data page decryption, large string writes, and a performance improvement reducing object allocations. The arrow-rs project shipped version 58.1.0 in March with no breaking API changes, and version 58.2.0 is scheduled for April.

The JDK 17 migration discussion that JB Onofré initiated continued drawing community input. If Arrow Java 20.0.0 sets JDK 17 as the minimum, it would align Arrow with Iceberg's Java modernization trajectory, effectively raising the Java floor for the entire lakehouse stack in a coordinated move.

Nic Crane's thread on using LLMs to aid project maintenance continued generating discussion, with Arrow's framing centering on how maintainers themselves can use AI tools to manage the project's growing codebase and issue backlog. Sutou Kouhei's Map type key/item/value field names thread drew continued engagement from Micah Kornfield and Antoine Pitrou, working through naming consistency across language implementations. Google Summer of Code 2026 student proposals also arrived during this window, with interest in compute kernels and language bindings.

Apache Parquet

Parquet's community held its bi-weekly sync during this period and continued active technical discussions. The File logical type proposal remained the project's most consequential design thread. The proposal would allow Parquet files to natively represent unstructured data, images, PDFs, audio, inside columnar files. If adopted, it would expand Parquet's role from a purely analytical format to a hybrid that can manage the unstructured data AI/ML pipelines generate alongside the structured features they consume.

The Variant type that shipped in February continued to see adoption discussion, with contributors sharing integration experiences across Spark, Trino, and Dremio. Variant brings native semi-structured data support to Parquet, eliminating the need to store JSON strings in regular columns. Combined with the File logical type proposal, Parquet is rapidly expanding its type system to handle the diverse data shapes modern analytics and AI workloads demand.

The ALP (Adaptive Lossless floating-Point) encoding spec completed its final review during this period, with the formal acceptance vote expected imminently. ALP will significantly improve compression ratios for floating-point data, a direct benefit for scientific computing and ML feature stores where float-heavy columns dominate.

Cross-Project Themes

The AI contribution policy conversation is the dominant cross-project theme of this period. Iceberg, Polaris, and Arrow are all grappling with the same question from different angles: Iceberg and Polaris are focused on contributor-side disclosure and review standards for AI-generated code, while Arrow is exploring how maintainers can responsibly use AI for project upkeep. With many of the same people participating across lists, coordinated policies are likely to emerge, and the Iceberg Summit provides the perfect venue for in-person resolution.

The second theme is format scope expansion. Parquet's File logical type and Variant type, Iceberg's efficient column updates for wide ML tables, and Polaris's Ranger integration and catalog federation all point in the same direction: the open lakehouse is being asked to handle workloads far beyond traditional analytics. The stack is evolving into a unified platform for structured analytics, semi-structured data, unstructured files, and AI/ML feature engineering, all governed through a single open catalog.

Looking Ahead

The Iceberg Summit on April 8–9 is the event to watch. Expect in-person discussions on AI contribution guidelines, efficient column updates, and V4 design direction. On the release side, watch for the Iceberg 1.10.2 patch release vote to open, the Polaris 1.4.0 release planning to finalize, and the Parquet ALP encoding vote to close. Arrow's 24.0.0 release cycle planning should also begin to take shape.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads