Alex Merced

Posted on Nov 24

Apache Dev List Digest: Iceberg, Polaris, Arrow & Parquet (Nov 18–24, 2025)

#dataengineering #resources #data #opensource

Get Data Lakehouse Books:

Lakehouse Community:

For engineers building on open data platforms, there’s no better window into project roadmaps than the Apache dev mailing lists. These threads are where specs are debated, edge cases get dissected, and future releases start to take shape.

This week’s digest (Nov 18–24, 2025) covers notable updates from four key projects in the lakehouse ecosystem:

Apache Iceberg: table format for analytic datasets
Apache Polaris: REST-based Iceberg catalog and governance layer
Apache Arrow: cross-language in-memory columnar format
Apache Parquet: columnar file format widely used for storage and exchange

Apache Iceberg

A Quiet Week for Proposals, Focus Remains on Execution

The Iceberg dev@ list had no new proposals or design threads during this period. While the mailing list was quiet, this reflects a continuation of work already in flight rather than a lull in project activity.

Recent threads from earlier weeks (e.g. metadata caching policies, server-side scan planning enhancements, and multi-table transaction coordination) continue to move forward in implementation.

Some signs of ongoing focus:

The Iceberg-Rust 0.8.0 release is being scoped, with contributors working on features that improve DataFusion integration and Avro decoding.
Ongoing improvements to REST Catalog APIs, credential handling, and snapshot planning—topics raised in early November—remain in development.

In short, while there were no fresh discussion threads, the project remains active under the hood as teams advance existing designs toward release readiness.

Apache Polaris

Internal Architecture and Event Semantics Take Center Stage

Although fewer threads appeared this week, the Polaris developer list continues to reflect the project’s maturing architecture—especially in areas related to catalog events, testing infrastructure, and release automation.

Event Listener Design: Notification vs. Interceptor

The community continued a key design thread on how Polaris should support multiple event listeners:

Contributors proposed making event listeners notification-only and non-blocking.
A separate SPI (Service Provider Interface) may be introduced for interceptors that can modify or halt processing.
The goal is to avoid confusion and unintentional coupling by cleanly separating audit-style logging from policy enforcement.

Multi-Table Commit Events: Clarifying Lifecycle Semantics

Ongoing discussions refined how Polaris emits events for multi-table transactions:

Consensus is forming around emitting after-commit events only once all involved tables have been committed.
There's openness to later introducing a staged-commit event type for advanced use cases, but the default will stay simple to ensure clarity for downstream consumers.

Generic Table Feature Graduates from Beta

Polaris will officially graduate the “Generic Table” capability from beta in the upcoming 1.3.0-incubating release:

This feature allows external engines (like Hudi or custom formats) to register table definitions in Polaris.
After multiple iterations and enhancements (including S3 credential support and format-specific extensions), it’s now considered stable and production-ready.

AWS Testing and CI Strategy

With AWS credits now available, the team discussed how to run end-to-end integration tests against real cloud infrastructure:

Real AWS tests will be limited to key features (e.g., IAM AssumeRole flows) not easily simulated locally.
Most testing will still rely on in-container S3 mocks to ensure quick and reliable CI runs.
Dedicated suites like RestCatalogS3IT are being introduced to house cloud-specific validations.

🛠️ Release Automation Levels Up

Polaris's release tooling now includes:

GitHub Actions workflows to automate release branch creation, version bumping, and RC artifact publishing.
Scripts to verify GPG signatures and reproducibility of artifacts.
Plans to fold the CLI and Python SDK into the same pipeline.
Future integration with Apache Trusted Releases for added supply-chain confidence.

Together, these changes mark a major investment in release quality and maintainability as Polaris moves closer to graduation.

Apache Arrow

A Pause in Public Threads, But Work Continues

The Apache Arrow dev@ list had no new discussions during the November 18–24 period. This marks the second quiet week in a row, following the recent Arrow 22.0.0 and ADBC 21.0.0 releases earlier in November.

Despite the lack of new dev-list threads, Arrow contributors remain active in several key areas:

Maintenance of language-specific packages (e.g., Java, Rust, Go) is ongoing through GitHub and offline discussions.
The community continues to coordinate minor version planning and post-release support.
Early-November proposals like aligning Arrow’s release calendar with CPython’s are likely progressing off-list.

For those tracking Arrow, this kind of brief quiet period is not unusual—it often follows major releases as implementation and patching move forward behind the scenes.

Apache Parquet

String Layout Design and Sync Coordination

Apache Parquet was the most active of the four projects this week, with thoughtful design discussion and community organization taking place on the dev list.

[DISCUSS] String/Byte-Array Layout Rework

Micah Kornfield opened a design thread exploring alternative page layouts for string and byte-array columns:

The proposal focuses on improving random access and memory locality.
One concept is to reuse FSST dictionaries across multiple pages to reduce decoding overhead.
Arnav Balyan and others responded with feedback, noting that FSST symbol tables are small (~2KB) and could be leveraged effectively across page boundaries.

This discussion is early-stage but signals a push to improve performance in real-world analytical workloads where random-access string decoding is a bottleneck.

Thanksgiving Sync Schedule

Julien Le Dem checked in with the community about the status of the weekly Parquet sync scheduled for Nov 26 (Thanksgiving week in the U.S.):

After polling contributors, he confirmed the meeting would proceed.
Julien provided instructions for alternative hosts to facilitate if needed.

While lightweight, these discussions reflect the project’s healthy operational cadence and strong contributor engagement.

Wrapping Up

Though quieter for some projects, the Nov 18–24 window still showed meaningful forward motion in the Apache data ecosystem:

Iceberg focused on execution and implementation behind previously proposed features.
Polaris solidified event listener semantics, improved release tooling, and moved Generic Table to GA.
Arrow maintained momentum off-list following a major release cycle.
Parquet initiated low-level design work to enhance performance in string-heavy datasets.

Stay tuned for next week’s digest as we continue to track the evolution of the open lakehouse stack from the inside out.

DEV Community