Alex Merced

Posted on May 20

Apache Data Lakehouse Weekly: May 13-20, 2026

#data #dataengineering #news #opensource

This was a release-heavy week across the lakehouse projects. Iceberg pushed 1.11.0 through a fourth release candidate while shipping a 1.10.2 patch in parallel. The V4 spec discussion accelerated with two simultaneous votes on content stats and relative paths. The Iceberg Go client kicked off a 0.6.0 vote. Parquet held a vote on format changes for IEEE 754 total order semantics and stood up a brand-new Footer Working Group. Arrow released ADBC 23 and Arrow 24.0.0, and the community spent serious time debating donation of an Erlang implementation and dictionary-encoded extension types. Polaris was relatively quiet on the dev list after its 1.4.1 patch went out on May 1, but the community continued discussions around catalog federation and the path to 1.5.0.

The connecting theme this week was format evolution under pressure. Iceberg V4 is no longer a far-off design exercise. Specific votes are landing. Parquet is restructuring how the format committee handles its most fundamental data structure, the footer. Arrow is grappling with whether to expand its extension type system to handle increasingly complex use cases coming from the engines that depend on it. These projects are all making decisions now that engines and storage providers will have to live with for years.

Apache Iceberg

The big story on the Iceberg dev list this week was the simultaneous push on three different release tracks and two V4 spec votes that ran in parallel. The community is moving fast.

Aihua Xu opened the Iceberg 1.11.0 RC4 vote after several prior release candidates surfaced issues that needed fixes. Binding +1 votes came in from Russell Spitzer, Eduard Tudenhöfner, Amogh Jahagirdar, and Kevin Liu, along with non-binding +1s from Steven Wu, huaxin gao, Yufei Gu, Jean-Baptiste Onofré, Yuya Ebihara, Neelesh Salian, Andrei Tserakhau, and Bryan Keller. The discussion threaded through environment-specific test failures and a few late patches that had to be cherry-picked to the release branch. Aihua handled multiple status updates and the eventual successful close. Going from RC1 in late April to RC4 in mid-May suggests this release had more friction than the team would have liked, but the spread of binding voters across companies is the right signal for a healthy major release.

While 1.11.0 was working through its final candidate, Amogh Jahagirdar shepherded the 1.10.2 patch release. He kicked off the Apache Iceberg 1.10.2 RC1 vote and collected +1s from Russell Spitzer, Kevin Liu, Yufei Gu, huaxin gao, Yuya Ebihara, Aihua Xu, Neelesh Salian, roryqi, and Eduard Tudenhöfner before announcing the release of 1.10.2 midweek. Kevin Liu confirmed the announcement. The fact that the project can run a 1.10.x patch release and a 1.11.0 major release in parallel reflects the maturity of the release infrastructure that the community has built over the past year. Two years ago, a parallel release would have been hard to coordinate. Now it is routine.

The V4 spec is where the more consequential work happened. Daniel Weeks opened the vote on relative paths in v4, drawing immediate engagement from Amogh Jahagirdar, Anoop Johnson, Anurag Mantripragada, Alex Stephen, Gang Wu, roryqi, Fokko Driesprong, Andrei Tserakhau, and Junwang Zhao. Relative paths are one of those structural changes that look small in the spec but have major implications for portability. Tables written with relative paths can be moved between buckets, regions, or storage backends without rewriting every manifest. For organizations running disaster recovery or migrating clouds, this is a real operational improvement. The volume of votes in the first few days of the thread suggests broad community support.

Running in parallel, Eduard Tudenhöfner opened the vote on Content Stats representation in v4. Russell Spitzer, Talat Uyarer, Wing Yew Poon, Steven Wu, Amogh Jahagirdar, and Daniel Weeks all participated, and Eduard followed up with two additional status messages. Content stats matter because they directly affect query planning. Better stats let engines prune more aggressively, which means lower scan costs and faster queries. Getting the representation right in the format means every engine that supports v4 gets the same benefit.

Ryan Blue stepped back into the dev list with a proposal to add an unregister table endpoint to the REST spec. The thread drew responses from Russell Spitzer, Ajantha Bhat, Péter Váry, Jean-Baptiste Onofré, Yufei Gu, Amogh Jahagirdar, Steven Wu, Neelesh Salian, Daniel Weeks, and Steve Loughran. Today, removing a table from a catalog without dropping the data is awkward. Some catalogs treat it as drop with delete-data-false. Others have custom endpoints. Adding a standard unregister operation to the REST spec means every Iceberg client gets the same behavior regardless of catalog implementation. That is the kind of cleanup work that does not generate headlines but pays off for every operator who has ever had to migrate tables between catalogs.

Prashant Singh proposed a closely related change with the addition of an X-Iceberg-Client-Capabilities header to the REST spec. Sung Yun and Daniel Weeks weighed in. Client capability negotiation is the kind of protocol detail that becomes critical as the ecosystem grows. A REST catalog needs to know what features a calling client supports before deciding which response shape to return. Without a header like this, the catalog ends up either returning the lowest common denominator or breaking older clients. The Iceberg REST spec is starting to look a lot like the HTTP spec did in its early years, with each iteration filling in the practical details that come up only in production deployments.

On the language client side, Matt Topol opened the Iceberg Go 0.6.0 RC0 vote. Andrei Tserakhau, Neelesh Salian, Kevin Liu (three messages), Tanmay Rauth, and Gang Wu joined the thread. Iceberg Go has been quietly building up steady momentum as a serious alternative for tools that do not want a JVM dependency. The release cadence is consistent. Each version closes more of the gap with the Java reference implementation. The number of new contributors voting on releases is one of the better signals of a healthy language client community.

Anurag Mantripragada continued his work on column-level updates with a new proposal for column update file representation. Gang Wu replied with feedback. Efficient column updates are one of the few remaining gaps in Iceberg's update story. The V4 design work is reportedly opening up new possibilities for handling updates without rewriting entire data files. This thread is one to watch closely as V4 work continues.

A few other discussions worth flagging. Kevin Liu raised a spec ambiguity question on the Avro schema for the day partition transform fields in manifests. These ambiguity threads tend to come from someone trying to implement Iceberg in a new language and running into edge cases the original Java implementation glossed over. Resolving them in the spec means every future implementation gets the same answer.

Steven Wu's vote on adding the CatalogObjectIdentifier schema to OpenAPI drew responses from Yufei Gu, Russell Spitzer, huaxin gao, Christian Thiel, Alexandre Dutra, Jean-Baptiste Onofré, Steve, Ajantha Bhat, Daniel Weeks, Renjie Liu, and Péter Váry. Long thread, many committers engaged. The Iceberg REST OpenAPI specification is where a lot of the standardization work between catalog implementations actually happens. When Polaris, Lakekeeper, and other REST catalogs all agree on the same OpenAPI shape, multi-engine interoperability gets easier for everyone.

Henry Haiying Cai posted a Kafka Connect discussion thread on worker coordinator progress detection. Kafka Connect remains an under-discussed but heavily used piece of the Iceberg ecosystem, and ongoing reliability work there matters for streaming ingestion pipelines.

The bottom line on Iceberg this week. The project shipped a patch release, drove a major release through a final candidate, opened two V4 spec votes, ran a Go client release vote, and discussed standards-level REST changes. That is a sustained pace of work that very few open source data projects can match.

Apache Polaris

Polaris had a quieter week on the dev list than usual, which is the expected pattern after a release. Apache Polaris 1.4.1 shipped on May 1 as a patch release on the 1.4.0 line that came out April 21. The 1.4.1 patch addressed storage URI handling and a few smaller fixes. The release cadence for the Polaris community has settled into a roughly monthly minor or patch release with quarterly major versions, which is a sustainable pace for a project that graduated from the Apache Incubator only in February.

The graduation context is worth keeping in mind when looking at this week's activity. Polaris is now an Apache Top-Level Project rather than a podling. That means the project's governance and infrastructure operate under the full Apache umbrella, with its own PMC and direct relationship with the Apache board. The transition has not changed the day-to-day cadence on the dev list, but it has changed what kinds of discussions happen. Release voting no longer needs IPMC approval. PMC member additions and committer promotions happen through the project's own process. The community has fewer process gates to work through. That tends to translate into faster feature iteration over time.

The 1.4.0 release that landed in late April was substantive. It introduced granular control over how Polaris interacts with cloud storage, enabling multi-tenancy with support for separate encryption keys and IAM identities across catalogs. Iceberg metrics persistence to the Polaris database moved from preview to stable, giving operators a way to capture ScanReports and CommitReports directly without piping them through an external metrics system. The Helm chart picked up a JSON schema for validation of values files, which is the kind of detail that turns out to matter a lot when you are trying to operationalize the catalog at scale.

Catalog federation continues to be one of the most-discussed areas of the Polaris codebase even when there is not a specific dev list thread driving it. Federation lets a single Polaris instance act as a routing layer for tables that live in other Iceberg REST catalogs, Hive Metastores, or other systems. For organizations consolidating multiple catalog sources without forcing a big-bang migration, federation is the answer. The 1.4 release built on the federation work from 1.1 and 1.2, adding extension points that make it easier to add new catalog types without rewriting the core federation logic.

The path to Polaris 1.5.0 is starting to take shape based on signals from the broader community. Generic tables, which let Polaris manage non-Iceberg formats like Delta Lake and Hudi through a unified catalog interface, are likely to see additional polish in the next release. Open Policy Agent integration for external authorization will continue to mature. And the work on persisting events to multiple backends, which arrived in preview form in 1.2 and was expanded in 1.4, is one of those features that turns the catalog into a real source of operational truth for governance teams.

Worth noting for Polaris watchers. The Snowflake engineering blog published a detailed analysis of the 1.4 release that covers the security and metrics work in more depth. The Dremio blog also covered Polaris's positioning as the catalog standard for Iceberg lakehouses and agentic analytics, framing how the catalog layer connects to AI agents that need to query and act on lakehouse data.

The pattern across both pieces is that Polaris is being positioned not just as an Iceberg catalog but as the governance and discovery layer for the lakehouse as a whole. That framing has implications for what gets prioritized in the next release. Features that help AI agents discover tables, understand schemas, and reason about data quality become first-class citizens. Features that help human governance teams manage policy, audit access, and trace data lineage become equally important. The dev list has not yet had the formal proposal threads for these features, but the trajectory is clear from the community discussions and from how vendors building on Polaris are talking about the project.

Apache Arrow

The Arrow dev list was as busy as it usually is. Arrow handles releases on a different cadence than Iceberg or Polaris, with multiple sub-projects releasing on their own schedules under the same project umbrella. This week saw release votes on multiple Rust crates, .NET, Go, and the main monorepo.

Andrew Lamb opened the Arrow Rust 58.3.0 RC1 vote, the Arrow Rust 57.3.1 RC1 vote, and the Arrow Rust 56.2.1 RC1 vote within a few days of each other. Three concurrent patch and minor releases across three different version lines is unusual even for arrow-rs, which has a famously fast release cadence. Ed Seidl, Bryce Mecum, Raúl Cumplido, and L. C. Hsieh voted on multiple of the threads. All three votes passed and the releases shipped. Arrow Rust is now on a release cycle measured in days rather than weeks, which means downstream projects that depend on it have to think harder about how aggressively they pin versions.

The DataFusion Python Bindings 31.0.0 vote also closed earlier in the period, continuing the steady release pace on the DataFusion side of the Arrow project. DataFusion is increasingly the engine of choice for new analytical tools built in Rust, and its release cadence reflects that. Every release pulls more contributors in and adds more SQL surface area.

The donation discussions on the dev list this week were the most interesting strategic signals. Sutou Kouhei opened the vote to donate Apache Arrow Erlang, drawing responses from Benjamin Philip, Curt Hagenlocher, Matt Topol, and David Li. Erlang is a language used heavily in telecom, distributed systems, and high-availability backend services. Adding an official Arrow implementation for Erlang opens up zero-copy data exchange for systems built on the BEAM virtual machine. The discussion was constructive, with Benjamin Philip following up on grant document specifics over several messages.

In parallel, Rok Mihevc opened the vote to donate pyarrow-stubs to Apache Arrow. Antoine Pitrou, Dewey Dunnington, Matt Topol, Raúl Cumplido, Nic Crane, Adam Reeve, Ian Cook, David Li, Sutou Kouhei, Alenka Frim, Jacob Wujciak, Joris Van den Bossche, wish maple, and L. C. Hsieh all participated. PyArrow stubs are the Python type annotations that IDEs and type checkers use to give developers autocomplete and error checking when they write code against pyarrow. The fact that this was a community project being formally donated to Arrow shows how much the broader Python ecosystem depends on these stubs, and how the community has stepped up to maintain them.

Dewey Dunnington opened an IPC representation discussion for dictionary-encoded extension types. Antoine Pitrou and Matt Topol joined. Extension types are one of those features that look niche on the surface but matter enormously in practice. They let Arrow carry domain-specific type information like geographic coordinates, UUIDs, or fixed-precision decimals without breaking interoperability with consumers that do not understand the extension. Combining extension types with dictionary encoding is technically tricky, and getting the IPC representation right means downstream engines can handle these types correctly when they cross network or process boundaries.

Antoine Pitrou opened a similarly foundational discussion with the question of whether to restrict field, schema, and custom metadata to UTF-8. Rusty Conover, Raphael Taylor-Davies, and Dewey Dunnington responded. This is the kind of spec question that has to be answered very carefully because the answer affects every existing Arrow implementation. Most production code already treats this metadata as UTF-8 in practice, but the format does not explicitly require it. Tightening the spec to match practice is good cleanup work.

Raúl Cumplido announced Apache Arrow 24.0.0 after passing the vote. The release includes work across the C++, Python, R, and Java implementations. The Arrow monorepo has slowed its release cadence to roughly once per quarter while the language-specific implementations like Rust, Go, and .NET release on their own schedules. The split-cadence approach lets the fast-moving language implementations ship features without being held back by the rest of the project, while the monorepo releases provide stable points for the C++ ecosystem to build against.

David Li announced Apache Arrow ADBC 23. ADBC, the Arrow Database Connectivity standard, continues to gain traction as an Arrow-native alternative to ODBC and JDBC. The release notes were not detailed in the announcement but the steady cadence speaks for itself. ADBC drivers exist for an increasing number of databases, and the API is stabilizing.

A few smaller items rounded out the week. Curt Hagenlocher announced Apache Arrow .NET 23.0.0 after the RC0 vote passed. Matt Topol announced Apache Arrow Go 18.6.0. Tornike Gurgenidze opened a thoughtful discussion thread on an ADBC Partitioned Bulk Ingest API with a response from Curt Hagenlocher. Dan Mattheiss proposed AVX2 optimization work for the parquet bloom filter with responses from Antoine Pitrou and wish maple. Antoine Pitrou announced an in-person Arrow and Parquet meetup in Paris. And Jarek Potiuk opened a discussion about Arrow involvement in the Community Over Code Glasgow 2026 hackathon, with positive responses from Nic Crane, Antoine Pitrou, Raúl Cumplido, and Jean-Baptiste Onofré.

The state of Arrow remains strong. The project has the broadest contributor base of any of the four projects covered here, with regular contributions across roughly a dozen languages and multiple sub-projects. The release management work that Raúl Cumplido and Sutou Kouhei handle quarter after quarter is the unsung infrastructure that makes the whole ecosystem function.

Apache Parquet

Parquet had the most consequential governance moment of the week. Jiayi Wang announced the kickoff of the Parquet Footer Working Group. Raúl Cumplido joined the discussion, and Jiayi followed up. The Parquet footer is the central metadata structure in every Parquet file. It tells readers what columns exist, what their types are, where the row groups live, what statistics are available, and how the data is encoded. Anything that changes the footer changes Parquet itself.

Standing up a formal working group for the footer signals that the community sees a need for structured discussion on what the footer should evolve into. Recent threads on the dev list reinforce this. Divjot Arora opened a discussion on Parquet Footer Options. Daniel Weeks proposed support for non-contiguous pages in Parquet, drawing extensive responses from Andrew Bell, Adrian Garcia Badaracco, Micah Kornfield, Will Edwards, and Andrew Lamb. Steve Loughran raised concerns about hardening variant readers, and Antoine Pitrou responded. Each of these threads touches the footer in some way. A working group with a defined scope and charter is a reasonable way to handle a body of work that big.

Gang Wu opened the vote on PARQUET-2249, the format change for IEEE 754 total order and NaN counts. Micah Kornfield, Steve Loughran, and Ed Seidl voted. Total order and NaN handling are exactly the kind of numerical correctness issues that lurk in floating-point data for years until they cause a query to silently produce wrong answers. Resolving them at the format level means engines do not have to invent their own conventions, and means data that crosses engine boundaries behaves predictably.

Russell Spitzer opened an automated release discussion for Parquet. Arnav Balyan, Gang Wu, and Fokko Driesprong all joined. Release automation is the kind of internal project that pays off every release cycle. Polaris went through a similar automation push last year and now releases on a much more predictable cadence as a result. The Parquet project handling its own automation thread suggests the community is ready to move from the artisanal release model to something more repeatable.

Gang Wu announced the release of Apache Parquet Java 1.17.1 after the RC0 vote passed. The release passed binding votes from Steve Loughran, Fokko Driesprong, Russell Spitzer, Daniel Weeks, and Xinli shang. Manu Zhang had earlier kicked off the discussion on the next parquet-java release, with input from Steve Loughran, Aaron Niskode-Dossett, Fokko Driesprong, Julien Le Dem, Gang Wu, and Rahil C. That kind of thread, where multiple committers weigh in on scope before the actual release work begins, is exactly the right governance pattern for a format-defining library.

Micah Kornfield announced Ed Seidl as a new Parquet committer. Andrew Lamb, Gang Wu, and Raúl Cumplido offered congratulations. Ed has been a steady contributor across both Parquet and Arrow for some time. The promotion to committer reflects the work he has put in across a number of release votes and discussion threads.

Arnav Balyan opened two discussion threads that point to where the Parquet community is thinking about AI-assisted contribution. The first was a discussion on AI tooling policy for Parquet, drawing a response from Fokko Driesprong. The second was a discussion on adding AGENTS.md to parquet-java, with input from Aaron Niskode-Dossett, Andrew Lamb, and Micah Kornfield. AGENTS.md is the emerging convention for documenting how AI coding agents should interact with a codebase. The fact that an Apache format project is having this conversation openly on its dev list, rather than letting individual contributors quietly use AI tools without disclosure, is the right approach. It is also a topic the Iceberg community has been working through for several months.

Micah Kornfield opened a discussion on remaining open spec-level questions for ALP. Andrew Lamb engaged. ALP is the adaptive lossless floating-point compression scheme that has been making its way through Parquet for the past year. It promises significant compression improvements on floating-point data versus the existing compression schemes. Getting the spec right matters because once ALP is in the format, every reader has to be able to decode it.

Other threads of note. Andrew Lamb opened a discussion on where the VariantJsonParser should live. Ed Seidl opened a discussion on making path_in_schema optional. Ismaël Mejía requested code reviews on Java performance optimization work, with engagement from Steve Loughran and Fokko Driesprong. Andrew Bell opened a discussion on wide schemas in Parquet. Each of these is the kind of practical improvement work that compounds over time.

Parquet is at an interesting point in its life cycle. The format is mature enough that nobody questions whether it should exist. The community is now wrestling with how to handle format evolution carefully, how to bring in new contributors at scale, and how to balance the needs of the established Java reference implementation against the growing number of native implementations in Rust, Go, and other languages. The Footer Working Group is the most visible sign of this maturation. The AI tooling policy discussion is another.

Cross-Project Themes

Three patterns connect the work across all four projects this week.

The first is format evolution under load. Iceberg V4 votes ran this week on relative paths and content stats. Parquet stood up a Footer Working Group and held a vote on IEEE 754 total order. Arrow discussed restricting metadata to UTF-8 and resolving IPC representation for dictionary-encoded extension types. These are all spec-level changes that ripple through the ecosystem. The fact that all three formats are evolving at the same time is not coincidental. The lakehouse architecture has matured to the point where the gaps between the formats are now visible in production. Each project is addressing the issues it can address, and the changes need to compose cleanly with each other.

The second is the rise of formal working groups and process discipline. Parquet now has a Footer Working Group. Iceberg has been running working groups on Kafka Connect, partition stats, and Python bindings for months. Polaris has clear schedule-driven release cadence and graduated to TLP. Arrow has multiple parallel release tracks managed by different committers. These projects are not just shipping features. They are building the institutional infrastructure that lets large communities collaborate without stepping on each other. That kind of process work is invisible from the outside but essential to long-term project health.

The third is the open conversation about AI tooling and disclosure. Parquet's discussion this week on AI tooling policy and AGENTS.md echoes similar discussions in Iceberg earlier this year. The community is figuring out how to handle AI-assisted contributions in a way that is transparent, that preserves the chain of provenance for the code, and that does not slow down legitimate contributors. The fact that these discussions are happening openly on dev lists rather than being handled through informal back channels is the right approach. The decisions made over the next few months will set the pattern for how Apache projects handle AI-assisted development for years.

Looking Ahead

Next week should bring the close of the Iceberg 1.11.0 release if RC4 holds up, which would be a meaningful milestone for the project's 2026 release cadence. The Iceberg V4 votes on relative paths and content stats are likely to close and move into implementation. The Parquet Footer Working Group will likely publish a charter and meeting schedule. Polaris watchers should expect early signals on what is heading into 1.5.0. Arrow has multiple smaller release votes likely to land in the next seven days, and the Erlang donation vote will close.

The bigger picture is that the lakehouse stack is maturing in step with the AI demand pulling on it. Every one of these format and protocol decisions affects how AI agents will be able to query, update, and reason about lakehouse data in the years ahead. The community is doing the unglamorous work now so the platforms can do the headline work later.

Resources and Further Learning

Get Started with Dremio

Try Dremio Free - Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris and Arrow - Learn how Dremio brings the open lakehouse stack together

Free Downloads