Apache Data Lakehouse Weekly: April 3–9, 2026

#data #dataengineering #news #opensource

The open lakehouse community gathered in San Francisco this week for the biggest Iceberg Summit yet, two full in-person days at the Marriott Marquis, while Arrow's release engineering hummed along, Polaris settled into its first full month as a top-level project, and Parquet's ALP encoding vote moved toward a close. The summit didn't just celebrate what the community has built; it provided the forum to hash out the debates that have defined the dev lists all spring.

Apache Iceberg

Iceberg Summit 2026, the third edition of the ASF-sanctioned gathering, ran April 8–9 at San Francisco's Marriott Marquis, growing to two full in-person days after last year's sold-out success drew nearly 500 attendees. The community warmed up at a Pre-Summit Meetup hosted by Bloomberg's Engineering Department on April 7, organized by Sung Yun, with lightning talks and networking before the main event. Speakers from Apple, Bloomberg, Pinterest, Wells Fargo, and contributors from across the vendor ecosystem took the stage, making this the most industry-spanning Iceberg event to date.

The V4 design direction was front and center. Ryan Blue and the core contributor group have spent months laying groundwork through the dev list, and the summit provided an in-person venue to align on what V4 will actually look like in practice. The metadata.json optionality thread — asking whether the root JSON file can be made optional when a catalog manages metadata state — drew contributions from Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu, debating portability concerns and the implications for static table and Spark driver behavior. The one-file commits discussion that Russell Spitzer and Amogh Jahagirdar advanced across multiple proposals is similarly headed toward resolution, promising dramatic reductions in commit latency and metadata storage footprint.

The AI contribution guidelines debate, which pulled in Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun on the dev list over the preceding weeks, was a natural candidate for in-person resolution at the summit. The community has been converging on disclosure requirements and code provenance standards for AI-generated contributions; with many of the same contributors in the same room, a working policy is likely to emerge from this week's discussions.

Péter Váry's efficient column updates proposal, targeting AI/ML workloads with wide tables that need to update embeddings and model scores without rewriting entire rows, was among the talks submitted to the summit program. The approach, which writes only the updated columns to separate files and stitches them at read time, addresses a real pain point for teams managing petabyte-scale feature stores on Iceberg. Watch for a formal proposal and POC benchmarks to land on the dev list in the days following the summit.

Apache Polaris

Polaris spent this week in its first full month of life as a graduated Apache top-level project after the February 18 graduation. Jean-Baptiste Onofré filed the project's first board report as a TLP at the March 26 ASF board meeting, documenting community health and strategic direction under Polaris's own PMC, a governance milestone that marks the project's full independence.

The Apache Ranger authorization RFC from Selvamohan Neethiraj continued drawing feedback this week. The design allows organizations already running Ranger alongside Hive, Spark, and Trino to manage Polaris security within a unified governance framework, eliminating the policy duplication and role explosion that arise when teams bolt separate authorization systems onto each engine. The plugin design is opt-in and backward compatible with Polaris's existing internal authorization layer, a thoughtful approach that should lower the adoption barrier for enterprises evaluating Polaris in regulated environments.

The 1.4.0 release, which will be Polaris's first as a graduated project, remains in active planning. Credential vending for Azure and Google Cloud Storage backends is the headline feature in the release cycle. The catalog federation design, allowing Polaris to serve as a front for multiple catalog backends in multi-cloud deployments, is also advancing, addressing the needs of large enterprises running Iceberg tables across AWS, Azure, and GCS simultaneously. With Polaris now holding its own dev list, JIRA, and PMC, expect the release velocity to accelerate now that the project is no longer navigating incubator overhead.

Apache Arrow

Arrow's engineering focus this week centered on release preparation and language-binding consistency. The arrow-rs 58.2.0 release was scheduled for April, following the 58.1.0 shipment in March, which arrived with no breaking API changes. The Rust implementation has become one of the most actively maintained parts of the Arrow ecosystem, with a release cadence that matches the project's growing adoption in data lakehouse query engines.

The JDK 17 minimum version discussion that Jean-Baptiste Onofré launched continued gaining traction. Setting JDK 17 as the floor for Arrow Java 20.0.0 would coordinate Arrow's modernization trajectory with Iceberg's own Java upgrade timeline, effectively raising the Java minimum across the entire lakehouse stack in a single coordinated move. Contributors including Micah Kornfield and Antoine Pitrou have been weighing in, and the decision is expected to crystallize before the 20.0.0 release cycle formally opens.

Nic Crane's thread on using LLMs to aid Arrow's project maintenance, framing AI tools as a resource for the maintainers themselves rather than just contributors, continued generating discussion. The Arrow community's angle is slightly different from Iceberg's: less about contribution disclosure policy and more about how a lean maintainer group can responsibly use AI to triage a growing issue backlog. Google Summer of Code 2026 student proposals also arrived this week, with interest concentrated in compute kernels and language bindings for Go and Swift.

Apache Parquet

Parquet's week was defined by two major technical milestones reaching final stages. The ALP (Adaptive Lossless floating-Point) encoding specification completed its review period, and the formal acceptance vote was expected to close this week. ALP delivers significantly better compression ratios for floating-point data by encoding the exponent and mantissa separately, a direct performance benefit for ML feature stores and scientific computing workloads where float-heavy columns dominate. The encoding has been the subject of months of careful review, and its acceptance marks one of the most meaningful additions to the Parquet specification in recent memory.

The Variant type that shipped in February continued to see integration discussion across engine teams. Spark, Trino, and Dremio contributors compared notes on their implementation experiences, working through edge cases in semi-structured data handling that the spec leaves partially open to interpretation. Getting these implementations to converge is critical: Parquet's value as a cross-engine format depends on consistent behavior across the ecosystem, and Variant is novel enough that divergence is a real risk.

The File logical type proposal, which would allow Parquet files to natively embed unstructured data like images, PDFs, and audio as columnar records, advanced through community discussion this week. Combined with Variant, the proposal signals a deliberate effort to evolve Parquet from a purely analytical format into a unified storage layer capable of managing the diverse data shapes that AI/ML pipelines produce alongside the structured features they consume.

Cross-Project Themes

The Iceberg Summit was, by design, where the open lakehouse community takes stock and sets direction. The threads that dominated all four dev lists in the months leading up to it, AI contribution policies, V4 metadata design, column-level updates for ML workloads, Polaris's enterprise integration roadmap, all converged in San Francisco this week. What happens on the dev lists in the next two to three weeks will reflect what was decided in person, and readers should expect a burst of formal proposals, updated design documents, and new voting threads as the summit's in-person alignment translates back into async collaboration.

The second theme running beneath all four projects is the expansion of format scope to meet AI workload demands. Parquet's ALP and Variant additions, Iceberg's efficient column updates for wide ML tables, Polaris's Ranger and federation work, and Arrow's modernization to JDK 17 are all responses to the same underlying pressure: the lakehouse stack is being asked to power AI/ML pipelines, not just analytical queries. The projects are evolving in coordination, and the pace of that evolution is accelerating.

Looking Ahead

Post-summit, watch the Iceberg dev list for formal proposals on V4 metadata optionality and single-file commits, along with a published AI contribution policy. The Parquet ALP vote result should arrive within days. Polaris 1.4.0 scope finalization and the Arrow 20.0.0 JDK decision are the other near-term milestones to track. If the summit follows the pattern of 2025's event, the community will also release session recordings on YouTube in the weeks that follow, an excellent resource for anyone who couldn't make it to San Francisco.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads