Apache Iceberg Dev List Digest August 25-29

#database #datascience #dataengineering

Apache Iceberg Dev List Link

Release planning and infrastructure

Java version issues for the 1.10.0 release – When building the Iceberg 1.10.0 release, some developers used Java 21 and encountered build failures because the Gradle build and Spark modules were pinned to Java 11. The discussion resolved that the project should target JDK 17 instead. Fokko Driesprong opened a PR to bump the build to 17 and Steven Wu merged it. Ryan Blue later confirmed he would test the release on JDK 11 to ensure backward compatibility.
Enabling a merge queue for Iceberg repositories – Renjie Liu proposed using GitHub’s merge queue so that pull requests are merged sequentially and CI runs only on the final result of queued changes. Several contributors (Jean‑Baptiste Onofré, Eduard Tudenhöfner, Fokko Driesprong, Peter Vary) supported the idea; Russell Spitzer and Kevin Liu noted that merge queues are easy to disable if problems arise. With no objections, Renjie planned to file a ticket with ASF Infra to enable the merge queue.

API proposals and deprecations

FileFormat API and position delete deprecation – Péter Vár presented an update on the FileFormat API proposal. He suggested dropping support for position‑delete files that include row data (since they are unused in the Java codebase) and deprecating native format‑specific readers/writers in favor of the new InternalData API. He outlined two options for handling deletes in the new API: implicitly defining delete file types via FileContent or allowing engines to convert PositionDelete objects. The discussion emphasised that removing row‑data position deletes simplifies the API and aligns with the planned deprecation (targeted for the 1.11.0 release), while still supporting readers for legacy files. Thread.
Deprecating position deletes with row data – In a related discussion, the community largely agreed to deprecate position‑delete files carrying row data. Contributors noted there are no producers of such files, and v3 features like delete vectors make them obsolete. Ryan Blue recommended removing write support immediately and optionally keeping read support for backward compatibility. Péter Vár planned to start a vote to remove this feature in Iceberg 2.0 and to document the change in the 1.11.0 release notes.
Extending standardized statistics – Gábor Kaszab proposed expanding standardized table statistics (like file‑level metrics) to help query engines make better decisions. Jacky Lee responded that her team had extended column statistics internally and saw more than 30 % performance gains. She encouraged adopting the v4 format and offered to collaborate on a public proposal for richer statistics.
Clarifying type promotion in schema evolution – Nicolae Vartolomei asked what “type promotion” means when evolving schemas. Contributors explained that writers should use the latest table schema; readers can promote values (e.g., read an int as a long) but writing an int into a column defined as long is discouraged. Ryan Blue summarised that type promotion is well‑defined—writers should always conform to the current schema—and Micah Kornfield opened a PR to document the guidance.

Features and tooling

Analytics Accelerator Library for Amazon S3 – AWS engineers introduced the Analytics Accelerator Library (AAL) for Amazon S3 and proposed making it the default input stream for Iceberg’s S3FileIO. A community sync was scheduled for 27 Aug; afterwards, Michael Stubbs summarised action items: create an epic in Iceberg’s JIRA, investigate vectored reads, extend customer testing, decide when to use the async client, analyse heap usage with and without AAL, produce a public benchmark document, and invite 3rd‑party storage vendors to test the library. Thread.
Increasing the REST spec max table format version to 3 – Amogh Jahagirdar proposed a PR to update the REST catalog specification to allow table format version 3. Because v3 had already been ratified and would ship with Iceberg 1.10.0, Ryan Blue, Kevin Liu and Russell Spitzer all responded with +1 votes and no objections.
Adding a columns_written metric – Manikandan R suggested adding a metric to file metadata listing which columns were actually written. Knowing which fields appear in each file could help skip irrelevant files during query planning. The idea did not receive responses during this week, but it highlights interest in richer file‑level metadata.
GitHub Action to lint Markdown – Manu Zhang proposed adding a GitHub action to lint Markdown files. Eduard Tudenhöfner suggested using Spotless, which the project already uses for Java code and supports Markdown. Fokko Driesprong and Russell Spitzer backed the idea of consistent formatting and preferred reusing existing tools rather than introducing new dependencies. Manu experimented with Spotless and discovered it only worked on Markdown files within the Gradle project; he updated the configuration to include docs and site directories and continued investigating alternatives for directories outside the build.

PyIceberg and ecosystem

PyIceberg brainstorming session (Sept 30) – Kevin Liu announced a PyIceberg community brainstorming session scheduled for 30 Sept and solicited ideas for improving the Python library. On Aug 29 he outlined potential focus areas: feature‑parity for table maintenance with the Spark implementation, performance improvements using Iceberg‑rust to eliminate custom Cython/Avro readers, stabilising the public API for a 1.0 release, adding new functionality (v3 format support, Avro, merge‑on‑read deletes, deletion vectors), and improving documentation. He encouraged contributors to start a shared document to collect ideas and proposed forming workstreams to coordinate efforts.

Community logistics

Analytics accelerator follow‑up and meeting notes – After the Aug 27 community sync on the S3 Analytics Accelerator, Michael Stubbs posted action items and next steps (see above), inviting participants to help create the epic and benchmarks.
Reminder: podling report – The Incubator PMC reminded Iceberg that its quarterly report is due by early September, outlining the expected contents and stressing that candidate names should not be included before formal election.

Overall, the Iceberg community discussions during Aug 25–29 2025 revolved around release preparation (Java versions and merge queues), refining APIs (FileFormat proposals, statistics, type promotion), tooling improvements, and planning for future features like the Analytics Accelerator Library and PyIceberg enhancements.

DEV Community