Get Data Lakehouse Books:
- Apache Iceberg: The Definitive Guide
- Apache Polaris: The Defintive Guide
- Architecting an Apache Iceberg Lakehouse
- The Apache Iceberg Digest: Vol. 1
Lakehouse Community:
- Join the Data Lakehouse Community
- Data Lakehouse Blog Roll
- OSS Community Listings
- Dremio Lakehouse Developer Hub
The past two weeks have seen active development and discussion across the mailing lists of Apache Iceberg, Polaris, Arrow, and Parquet. This digest highlights notable design discussions, release planning, and community updates from each project—so you can stay on top of the lakehouse ecosystem’s evolution.
Apache Iceberg
☕ Java 17 Minimum Requirement
Jean-Baptiste Onofré proposed raising the minimum supported Java version for Iceberg to JDK 17. This proposal received widespread agreement and is expected to move forward.
- Why it matters: Enables modern Java features and aligns Iceberg with other cutting-edge data infrastructure projects.
- Impact: Users still on Java 11 should begin preparing for an upgrade.
Format V4: Indexing and Commit Optimizations
Design conversations around Iceberg Format V4 continued with emphasis on two areas:
- Native Indexing Support A revived discussion explored integrating indexing directly into the Iceberg table format to support faster lookups.
- One-File Commit Proposal While this discussion began earlier, it wrapped up in this period. The aim is to reduce commit overhead by consolidating manifests into a single file.
REST Catalog Enhancements
Several key REST Catalog improvements were discussed and voted on:
-
ETag Support: Added to
CommitTableResponsefor optimistic concurrency. - Idempotency Keys: Introduced to safely retry REST operations.
- HTTP 429 Standardization: Formalized handling of rate limiting.
- Storage Credentials in Planning Responses: Enables catalogs to return temporary credentials for secure data access.
ETag vote thread
Idempotency keys discussion
Storage credentials planning vote
Flink Connector & View Support
FlinkSink Metadata Enhancements
Proposed: allow writing user-defined stats (e.g. row count) during Flink writes.Register View API
Early discussion about adding support to register logical views as catalog entities.
Metadata proposal
Register View discussion
👥 Community Updates
- New PMC Members: Kevin Liu and Matt Topol were added to the Project Management Committee.
- Meetup Announced: Iceberg Community Meetup held in Amsterdam on Dec 11.
- Release Activity: Patch release 1.10.1 planned, focusing on bug fixes and stability.
PMC addition announcement
Meetup announcement
Apache Polaris
1.3.0-incubating Release Approved
The Polaris community finalized and approved the release of version 1.3.0-incubating after resolving issues in the initial release candidate. The version includes:
- Generic Table GA: Graduation of the generic table feature to production-ready status, allowing seamless cataloging of external table formats like Hudi and Delta Lake.
- Improved Cloud Integration Tests: Strengthens stability in cloud-native environments.
- Bug Fixes and Reliability Enhancements
Release vote result
RC0 cancellation and RC2 vote
🔔 Event Listener Refactor
Refinements were made to Polaris's catalog event model:
-
Simplified Event Hooks: Deprecated
Before/AfterCommitTableEventin favor of a cleaner notification architecture. - Multiple Listener Support: Differentiated notification vs. interceptor behavior to prevent listener conflicts and enhance modularity.
Listener simplification thread
Interceptors vs. notifications discussion
♻️ Idempotency and Retry Support
In alignment with Iceberg's enhancements, Polaris initiated discussions to:
- Introduce Idempotency Keys in commit APIs
- Ensure safe retry mechanisms in case of transient failures
- Enhance robustness of the core catalog and REST APIs
☁️ AWS Integration Improvements
Several threads focused on expanding Polaris support for AWS authentication patterns:
- STS AssumeRoleWithWebIdentity: Allows AWS OIDC-based token flows (used in EKS, notebooks, etc.)
- AWS China ARN & KMS Support: Ensures compatibility with AWS partition differences and encryption configuration.
🛠️ Tooling and Dev Experience
- Python CLI Packaging: Polaris CLI being prepped for PyPI and nightly releases.
- Release Automation for Polaris Tools: Scripts and GitHub Actions added to streamline CLI and support package releases.
- Early UI Proposal: Community exploring a user-friendly UI for catalog introspection and onboarding.
👨👩👧👦 Community and Governance
- Community Sprint: Virtual collaboration event scheduled for Dec 16 to tackle bugs, docs, and onboarding.
- NoSQL Sync Meeting: Held Dec 2, focusing on extending Polaris capabilities to non-relational workloads.
- Incubator Progress: Polaris shared updates for its Apache Incubator status and roadmap alignment.
Sprint announcement
Incubator report thread
Apache Arrow
📦 Format Evolution: TimestampWithOffset
The Arrow community voted to add a new canonical type: TimestampWithOffset. This enhancement allows better timezone handling by encoding the UTC offset directly with each timestamp value.
- Why it matters: Avoids ambiguity in interpreting timestamps across systems with different local times or daylight saving settings.
- Vote passed unanimously, signaling strong consensus.
🧪 Experimental: 128-bit Timestamps
An earlier thread explored adding support for 128-bit picosecond-level timestamps, aimed at use cases requiring extreme time resolution (e.g., scientific or financial data).
- This is still under discussion and not yet planned for inclusion.
🚀 Language Releases
- Go: Arrow Go 18.5.0 release candidate (RC0) published and under vote.
- Rust: Arrow Rust 57.1.0 was recently released, with improvements to bitwise performance being considered.
- Java: Discussion started on Arrow Java 20.0.0, possibly decoupling its versioning for more agile releases.
Arrow Go RC0 vote
Arrow Rust changelog
Arrow Java release planning
🔄 Governance & Meetings
- New PMC Chair: Antoine Pitrou, one of Arrow’s co-creators, named new project chair.
- Community Meetings: Active meetings continued across Arrow working groups, including Arrow-R and general syncs.
- New Proposal (DACP): Early concept to introduce a “Data Access and Collaboration Protocol” to Arrow ecosystem.
PMC chair announcement
DACP intro thread
Apache Parquet
🧵 String Column Layout Optimization
Micah Kornfield started a design thread to optimize string/byte array page layouts:
- Proposed sharing compressed dictionaries across multiple pages using FSST (Finite State Entropy) encoding.
- Goal: Improve scan speed and reduce CPU overhead for large string columns.
☕ Java 1.17.0 Release Planning
Planning is underway for Parquet Java 1.17.0, which includes:
- Dropping Java 8, moving to Java 11 minimum
- Accumulated improvements and minor bug/security patches
🔍 Metadata Cleanup: Deprecating file_path
A proposal was made to deprecate the file_path field in column chunk metadata:
- Considered obsolete in modern Parquet workflows
- Will remain for backward compatibility but no longer actively used
📐 Toward Parquet Format V3?
- Developers expressed intent to finalize all outstanding v2 features (e.g., bloom filters, page checksums).
- Early hints suggest the community may begin scoping a Parquet V3 format in 2026.
📆 Community Syncs
- Despite the U.S. holiday, the weekly sync on Nov 26 went ahead as planned.
- Reflects the globally distributed and consistent engagement of Parquet contributors.
📌 Final Thoughts
This period marked steady evolution across the lakehouse projects:
- Iceberg is refining its API and planning for V4 features like native indexing.
- Polaris is progressing toward graduation with mature features and API resiliency.
- Arrow continues to invest in format flexibility and multi-language consistency.
- Parquet is optimizing performance while laying groundwork for future format innovations.
Stay tuned for more updates in the next dev digest!
Top comments (0)