Get Data Lakehouse Books:
- Apache Iceberg: The Definitive Guide
- Apache Polaris: The Defintive Guide
- Architecting an Apache Iceberg Lakehouse
- The Apache Iceberg Digest: Vol. 1
Lakehouse Community:
- Join the Data Lakehouse Community
- Data Lakehouse Blog Roll
- OSS Community Listings
- Dremio Lakehouse Developer Hub
This week marks a turning point for the open lakehouse stack. Polaris graduated from the Apache Incubator. Iceberg pushed deeper into V4 planning with a Warsaw community meetup on February 18. Arrow focused on IPC stream design. Parquet advanced its ALP encoding spec. Governance maturity and format evolution were the twin themes across all four projects.
Apache Iceberg
The Iceberg dev list stayed active this week with continued threads from the Feb 4–11 cycle, plus new community activity ahead of the Warsaw meetup on February 18.
metadata.json in V4 — Continued Discussion
Anton Okolnychyi's thread on making the root metadata JSON file optional in V4 continued drawing responses. The core problem remains unchanged: writing metadata.json on every commit hurts streaming write performance for HMS and Hadoop catalog users. Two paths are under discussion: letting catalogs skip writing the file, or offloading parts of it to external files. Portability concerns from Yufei Gu are keeping the debate careful and thorough. This discussion will shape how fast-commit workloads work in the next major format version. (Thread)
AI Contribution Guidelines Vote Progresses
Junwang Zhao's vote to formalize guidelines for AI-generated contributions entered its final window this week. Multiple binding +1 votes are on record. This move gives maintainers a clear reference when evaluating PRs with AI-authored code, a growing reality in open source. The Iceberg community is ahead of most projects in addressing this issue directly. (Thread)
OAuth2 and Trino Multi-Tenancy Question
A new community question from Sander Bylemans highlighted a gap in the REST catalog OAuth2 implementation. He wants Trino to pass a JWT to an Iceberg catalog for true multi-tenancy, but the current implementation expects a static token or basic credential. Alex Merced pointed to existing Trino documentation and open issues. This thread reflects real production friction as organizations move toward dynamic, role-based credential flows. (Thread)
Warsaw Community Meetup — February 18
The Apache Iceberg Europe Community Meetup lands in Warsaw on February 18, hosted at Google's Warsaw Hub office with ClickHouse co-hosting. Registration is open at luma.com. The call for speakers remains open for production use cases and migration stories.
Snapshot Expiration Race Condition
Krutika Dhananjay flagged a concurrency bug in snapshot expiration logic. A race window exists between when ExpireSnapshots computes candidate snapshots and when the commit runs. A concurrent ref addition during that window can lead the maintenance job to remove a live snapshot. The iceberg-go project already has a fix. Amogh Jahagirdar asked for a reproducible test case before confirming the issue exists in the Java implementation. (Thread)
Apache Polaris
Graduation Vote Passes — Polaris Becomes a Top-Level Project
The headline for this week and the biggest news in the lakehouse community: Apache Polaris passed its IPMC graduation vote. Russell Spitzer submitted the formal vote on February 3 after the PPMC consensus round received 27 +1 votes. Jean-Baptiste Onofré, the incoming PMC Chair, highlighted six releases (0.9 through 1.3.0), more than 100 contributors, and 2,819 merged PRs as evidence of project maturity. Polaris is now a top-level Apache project. (Vote Thread)
This milestone matters for production teams. A top-level project carries stronger governance guarantees. The ASF board now directly oversees Polaris. Organizations building catalog infrastructure on Polaris can treat it as a stable, independently governed project.
S3 Credential Vending Without STS
A Backblaze engineer joined the ongoing thread on vending S3-compatible credentials in environments without AWS STS. Two options remain on the table: passing the same credentials Polaris uses internally, or managing a separate client credential pair. The Backblaze contributor leaned toward an S3 signing approach that would work with non-AWS storage. This thread signals that the community is actively thinking about storage-agnostic credential flows, which matters for organizations using Cloudflare R2, Backblaze B2, and other S3-compatible object stores. (Thread)
Apache Arrow
Arrow 23.0.0 Released
Apache Arrow 23.0.0 shipped on January 27 with 336 resolved issues. The release spans C++, Python, Java, R, and Go bindings. This is the latest stable release and the first major version of 2026. (Thread)
Arrow Rust 57.3.0 Patch Release
Andrew Lamb proposed and completed the vote for Arrow Rust 57.3.0-rc1 during the Feb 4–11 window. The RC was cleaned up by February 6. Patch releases like this keep the Rust library current for projects like DataFusion and others building on Arrow's Rust compute layer. (Thread)
IPC Stream Multiplexing Design Thread
Rusty Conover pushed back on suggestions to use QUIC for IPC stream multiplexing. His use case requires explicit ordering across batches from different logical streams, not just independent delivery. This is a format-level design question about how to interleave Arrow schemas in a single IPC channel. The thread remains open and technical. It could affect how engines like DuckDB and DataFusion exchange data between pipeline stages. (Thread)
Arrow Formal Security Model Published
The Arrow PMC published a formal security model on February 5. The documentation covers how Arrow handles security across its C++, Java, Rust, and Python libraries. This kind of formal security posture is increasingly important as Arrow becomes load-bearing infrastructure in production analytics stacks. (Commit)
Google Summer of Code 2026 Interest
A student named Prasanna expressed interest in contributing to Arrow through GSoC 2026. Arrow is among the Apache projects accepting mentors and contributors for the 2026 program. Data engineers interested in mentoring or contributing should watch the dev list for formal program announcements.
Apache Parquet
Parquet Java 1.17.0 in Production
Parquet Java 1.17.0 released on January 13 is now moving through broader adoption. This version drops Java 8 support and sets Java 11 as the new minimum. Teams still running Java 8 JVMs need a migration plan. Projects like Iceberg, Trino, and Spark all depend on Parquet Java for their core read and write paths. (Thread)
ALP Encoding Spec Advancing
The Adaptive Lossless floating-Point (ALP) encoding spec continued moving through review. Contributors discussed whether the spec should land as a PR against the parquet-format repository. Julien Le Dem asked for remaining reviewers to comment before the spec can be finalized. ALP encoding targets floating-point columns specifically and can reduce file size significantly for scientific and financial data. (Thread)
Geospatial Blog Post in Review
Andrew Lamb opened a PR for a geospatial blog post on the Parquet website. A reviewer noted that column statistics for geospatial types are not yet integrated in all engines. The Rust Parquet implementation and SedonaDB already handle geospatial stats. The community wants the implementation status page updated before the post lands. Geospatial support in Parquet reflects growing demand from location-aware analytics workloads. (Thread)
file_path Field Deprecated
Micah Kornfield confirmed the PR deprecating the file_path field in column chunk metadata is ready to merge. The change discourages use of external column references in favor of table-level file handling. This simplifies Parquet's file scope and makes it easier for query engines to reason about file boundaries. (Thread)
New PMC Member: Andrew Lamb
Julien Le Dem announced that Andrew Lamb joined the Parquet PMC in late January. Lamb has been a central figure in governance discussions across the lakehouse ecosystem. His PMC role formalizes an influence he has already been exercising. (Thread)
Cross-Project Themes
Two themes define this week. The first is graduation and governance. Polaris is now a top-level Apache project. Iceberg is formalizing AI contribution rules. Parquet added a new PMC member. Arrow published a formal security model. These projects are not just technically mature. They are organizationally mature.
The second theme is format evolution for AI and streaming workloads. Iceberg's V4 metadata discussions, Polaris's credential vending work, and Parquet's ALP encoding all point in the same direction. The lakehouse stack is adapting to support wide-table ML workloads, real-time commits, and non-AWS storage environments. The community is not waiting for a big release to do this. It is doing it thread by thread.
Looking Ahead
Watch for the first post-graduation Polaris PMC activity. The Iceberg AI contribution guidelines vote will likely close this week. The Warsaw Iceberg meetup on February 18 may surface new production use cases. On the Parquet side, look for the geospatial blog post and the ALP encoding spec to land soon.
Top comments (0)