Alex Merced

Posted on Jun 4

Apache Data Lakehouse Weekly: May 28 - June 4, 2026

#database #dataengineering #news #opensource

The lakehouse projects spent this week doing two things at once. They lined up a remarkable stack of releases, with DataFusion 54.0.0 in active vote, Polaris 1.6.0 scheduled, Parquet weighing both a format release and a Java release, Arrow preparing Java 20.0.0, and Iceberg's C++ and Rust implementations both planning their next versions. At the same time, the foundation's own infrastructure pushed its way onto two dev lists, as the ASF warned both Iceberg and DataFusion that the shared pool of GitHub-hosted CI runners is running out of headroom. Add a wave of post-1.11 spec design in Iceberg, a major proposal landing in Polaris, and AI agents quietly showing up inside project workflows, and you get one of the busiest weeks of the quarter. This issue also marks a milestone for the newsletter itself: Apache DataFusion joins the regular rotation alongside Iceberg, Polaris, Arrow, and Parquet. The query engine layer has become too central to the lakehouse story to cover only in passing, and as this week makes clear, the DataFusion dev list moves fast enough to earn its own section every week. As always, every claim below links to the source thread on lists.apache.org, so you can follow any discussion straight to the people having it.

Apache Iceberg

The defining story on the Iceberg list right now is what happens after a major release ships. With 1.11.0 out the door in mid-May, the community pivoted hard into spec design, and the column updates work is the center of gravity. Anurag Mantripragada's thread on the column update file representation drew eight participants and eighteen replies, making it the most active design discussion of the week. The work follows the column updates design document and the original discussion thread, and it tackles the question of how engines write a change to one column without rewriting everything around it. Gábor Kaszab then opened a focused companion thread on the column update metadata representation, deliberately splitting the metadata question from the file question so each can move at its own pace. Together the two threads show the community doing spec work the right way, with separate surfaces for separate decisions and named owners for each.

For practitioners, the stakes of this work are easy to state. Today, updating a single column value in Iceberg means writing deletes plus new data files, and engines pay for that in write amplification, especially on wide tables where one changed column drags hundreds of untouched ones through a rewrite. A native column update representation changes the cost model for slowly changing dimensions, privacy corrections, and the feature-backfill patterns machine learning teams run constantly. Getting the file format and the metadata format right, separately and deliberately, is how a change this deep lands without breaking the ecosystem of readers.

The design wave did not stop there. Xiening Dai opened a discussion on global snapshot consistency for Iceberg tables, starting from the isolation levels the spec defines today through the write.delete, write.update, and write.merge isolation-level properties, which accept either snapshot or serializable. His thread pushes the conversation beyond a single table's guarantees, and it is worth watching as engines lean harder on Iceberg for transactional workloads. Ankit Kumar opened a thread on efficient CDC upserts, linking back to the existing CDC design document and the prior discussion thread. Change data capture keeps resurfacing on this list because it sits at the junction of everything else, touching row-level deletes, snapshots, and engine integration all at once. And Shekhar Rajak raised a precise spec question on Avro encoding for non-zone timestamp types, tied to PR #16577 and issue #12751, about how timestamp and timestamp_ns values carry the adjust-to-utc=false property in Avro.

The REST catalog spec kept moving too. Huaxin Gao bumped the vote on adding list and load function endpoints to the REST spec, reporting that the FunctionIdentifier versus CatalogObjectIdentifier debate is now resolved. PR #16144 merged CatalogObjectIdentifier, clearing the naming question that had stalled the vote. Functions in the REST catalog continue the pattern we saw through May, where the REST surface keeps absorbing capabilities that used to live engine-side.

A second theme this week was engine version policy, and it shows the cost of living downstream of fast-moving compute engines. Anurag Mantripragada opened a discussion on Spark versioning strategy with accelerated Spark releases. Spark 3.4 support is now removed following the 1.11 release, and the Spark community is proposing a faster release cadence, which forces Iceberg to decide how many Spark versions it can carry at once. Steven Wu opened the matching conversation for Flink version support after Iceberg 1.11.0, anchored on PR #16517, and it drew eight participants and ten replies. The shape of both threads is the same. Every engine version Iceberg supports costs CI time, reviewer attention, and release testing, and the budget for all three is finite. The tradeoff is the classic one. Support fewer engine versions and the project ships faster, tests less, and strands users on older clusters. Support more and the matrix swallows CI minutes and reviewer hours. Both threads are converging on explicit written policy rather than case-by-case calls, which is the right instinct, because a published support window lets platform teams plan upgrades instead of discovering them in release notes.

That budget question turned literal in the most active community thread of the week. Robert Thomson wrote to the Iceberg PMC about the project's consumption of ASF shared GitHub-hosted runners, noting that the foundation introduced its GitHub Actions usage policy in 2024 and that the shared runner pool has been at or near its limit. Twelve participants and fourteen replies later, this is a real planning problem, not a courtesy notice. Iceberg runs one of the largest CI matrices in the data space, across Java, Python, Rust, C++, and Go, and the foundation's capacity ceiling now sits underneath all of it. The likely outcomes are the ones other large projects have already reached for: trimming redundant jobs, gating expensive suites behind labels, leaning on self-hosted or donated runners, and being more deliberate about which platforms get tested on every commit. None of those choices is free, and the fourteen replies show contributors weighing each against the project's reliability bar.

The language subprojects had a productive week of their own. Junwang Zhao started the release discussion for Apache Iceberg C++ 0.3.0, working from the roadmap in iceberg-cpp issue #523 and noting that not every roadmap item needs to block the release. The thread gathered four participants and eight replies, and it follows directly from the 0.3.0 conversation that started in late May. On the Rust side, Danny Jones opened the tracking issue for iceberg-rust v0.10.0 as an action item from the Rust community sync, pointing contributors at issue #2527 as an open invitation. Jordan Epstein raised the harder structural question in a thread on reviewer bandwidth in iceberg-rust, which drew six participants. The Rust implementation has more contributor energy than reviewer capacity right now, and the thread is an honest attempt to fix that imbalance before it becomes a bottleneck.

Two more threads round out the week. Noritaka Sekiyama proposed adding an OpenTelemetry MetricsReporter to iceberg-core, which exports ScanReport and CommitReport data to any OTLP-compatible backend. The proposal drew seven participants and eleven replies, a strong signal that observability is an underserved need. Iceberg ships built-in reporters today, but OTLP export plugs table metrics into the monitoring stacks teams already run. The idea also fits the broader pattern of the post-1.11 cycle, where the project is investing in operational maturity alongside spec features. Scan and commit metrics that flow to Prometheus, Datadog, or any OTLP backend turn table health from a quarterly audit into a live dashboard, and that is the kind of capability that makes platform teams comfortable betting on Iceberg for their most critical workloads. Samuel Pacheco Cantu asked about relative paths and location resolution for multi-region replication, where data files live in different storage locations depending on the region, and got six replies of practical guidance. And Kevin Liu, who topped the month's activity with 43 messages, raised a flag on Iceberg Summit 2027. His event coordinator reports that most San Francisco conference venues are already booked for 2027, so the community needs to decide on timing and location now. Ten participants jumped in, which says something about how central the summit has become to the project's annual rhythm. Anyone who watched the 2026 summit's session recordings roll out over the past month knows the event now functions as the community's design checkpoint as much as its showcase, so a venue decision made this summer shapes the project calendar a full eighteen months out.

Apache Polaris

Polaris had the most concentrated burst of activity of any project this week, with 71 messages in the first four days of June alone. The headline is a proposal that has been months in the making. Jean-Baptiste Onofré published the Polaris Directories proposal, drafted as PR #4613, after a long arc of discussion that ran through earlier ideas like Table Sources. Directories give Polaris a way to reason about locations and the things that live in them, and the proposal arriving as a reviewable pull request rather than an external document is itself notable. JB explained why in a parallel update to the proposal docs as markdown thread. Given recent advances in AI tools, he is experimenting with writing the Directories proposal as markdown in the repository, where both humans and AI tooling can read, diff, and review it. The process change and the proposal are shipping together as one experiment. It is worth dwelling on why the venue matters. A proposal in an external document lives outside the project's history, with comments that vanish and versions nobody can diff. A proposal as markdown in the repository gets pull request review, line-level comments, a permanent record, and now an audience of AI tools that read repositories natively. If the Directories review goes well, expect this to become the default for how Polaris designs in the open.

The Polaris Console generated the longest thread of the week. JB is preparing the first release of the Console and asked the community whether the Console belongs in the main Polaris repository. Nineteen replies from seven participants worked through the tradeoffs, which mirror every monorepo debate you have ever seen, with release coupling and shared CI on one side and contributor focus on the other. The Console is clearly past the toy stage, because users are already filing real operational reports. Yong Zheng described how the Console's single-page-app architecture makes the Kubernetes port-forward workflow fail silently when the server and console run in the same namespace behind nginx, and the thread collected four replies of diagnosis. Bug reports like this one are a healthy sign. People only find port-forward edge cases when they are actually deploying the thing.

Release planning settled quickly. EJ Wang volunteered as release manager for Apache Polaris 1.6.0 and proposed targeting Friday, June 26. With 1.5.0 having shipped on May 18, that keeps Polaris on the steady monthly-ish cadence it has held all year, and nobody on the thread pushed back on the date. EJ also posted the notes from the metrics architecture sync in the long-running thread on REST endpoints for table metrics and events, keeping that design moving between calls.

The API design work this week clustered around correctness and interoperability. Huaxin Gao, active on both the Iceberg and Polaris lists this week, asked for wider review of the Idempotency-Key design, converging on Model B. The contract is simple to state and hard to implement: a retry with the same key must not produce additional side effects. Fifteen replies from five participants dug into the simplified design, and this is exactly the kind of plumbing that makes a catalog trustworthy under flaky networks and aggressive client retries. To see why it matters, picture a create-table call that times out at the client after succeeding on the server. Without idempotency keys, the retry fails with an already-exists error or, worse, triggers duplicate side effects downstream. With them, the catalog recognizes the repeated key and returns the original result. Agents make this urgent, because automated clients retry far more aggressively than humans do, and the word cloud on the Polaris list this month, where agentic sits right next to idempotency, suggests the community sees the same connection. Dennis Huo opened a discussion on adding support for new Open Sharing APIs in Polaris, motivated by the data sharing use case that keeps coming up as enterprises consolidate lakehouses and catalogs and need to grant access across organizational boundaries. And Adam Szita's thread on Iceberg table encryption support stayed active into this week, now at six participants and seven replies. Iceberg 1.11 shipped the base table encryption implementation with KMS-based key wrapping, and this thread is working out the catalog's half of that story, since encrypted tables only work end to end when the catalog can manage keys. The thread deserves a close read from anyone running regulated workloads. The split of responsibilities is taking shape, with Iceberg defining how data, delete, manifest, and manifest-list files are encrypted and the catalog deciding how keys are issued, rotated, and scoped to principals. When this lands, an encrypted lakehouse stops being a design exercise and becomes a configuration choice, and Polaris is positioning itself as the place where that choice gets made.

Under the hood, the SPI work continued. Tornike Gurgenidze opened a focused discussion on storage credential-vending SPI changes attached to PR #3699, following the broader SPI-surface thread, and Dmitri Bourlatchkov approved the PR, confirming it matches the direction the earlier discussions set. Credential vending is the mechanism that lets Polaris hand engines short-lived, scoped storage credentials, so getting this SPI right matters to every deployment. Robert Stupp restarted the object-storage mock testing discussion to settle the test-infrastructure question explicitly after the PR review went in several directions at once. Dmitri, who led the month's activity with 52 messages, opened two more maintainability threads out of the community sync: one on the future of the regtests code and one on code organization for Spark 3.x and 4.x, after the Spark 4 support work in PR #4535 produced a substantial amount of copied code. Adnan Hemani also resurfaced the OpenLineage proposal in its own thread so reviewers can find it, and eleven replies suggest the lineage integration has real momentum.

Step back and the Polaris picture is striking. In four days the project advanced a flagship proposal, debated its UI strategy, scheduled a release, designed idempotency semantics, opened a data sharing track, progressed encryption, and refactored its credential SPI. This is what a catalog community looks like when it is racing to match the pace of the format underneath it. It is also a reminder of how far Polaris has traveled in two years, from incubating project to the coordination point for encryption, lineage, sharing, idempotency, and a console, all moving in parallel under a monthly release rhythm. The catalog used to be the boring part of the stack. The threads above argue it has become the most interesting one.

Apache Arrow

Arrow closed out a donation and opened a debate about its own protocol surface. Sutou Kouhei, the month's most active poster with 16 messages, announced the result of the vote to donate Apache Arrow Erlang. The vote carried with four binding +1s from Sutou Kouhei, Curt Hagenlocher, Matt Topol, and David Li, with no zeros and no vetoes. The Erlang implementation is built on bindings to the Rust implementation, and the next step is formal IP clearance through a vote on the incubator general list. Once that completes, Arrow adds another language community to a roster that already spans most of the ecosystem, and the BEAM world gets first-class columnar data. The donation also says something about how Arrow grows now. New language communities arrive by wrapping the Rust implementation rather than reimplementing the columnar format from scratch, which keeps behavior consistent across languages and concentrates performance work in one codebase. Erlang and Elixir shops run some of the most demanding soft-real-time systems in production, and giving that platform zero-copy columnar data opens analytics patterns it has never had natively.

The most interesting technical cluster of the week was Flight SQL, where three protocol changes moved in parallel. Tornike Gurgenidze, the same contributor driving the Polaris credential vending work, proposed adding four dialect-related SqlInfo codes to FlightSql.proto, including SQL_SUPPORTED_LIMIT_OFFSET at code 577. The motivation is practical. Clients that compile SQL for many different backends need dialect metadata the protocol does not expose today, and four small codes close real gaps. Meanwhile, the vote on adding an is_update field to ActionCreatePreparedStatementResult collected feedback from four participants, with Jean-Baptiste Onofré adding a non-binding +1 and suggesting the vote run one extra week to give more reviewers time. And Richie Black opened a vote on adding column default value support to JDBC connections through Arrow Flight, implemented in arrow-java PR #1139, which also touches the FlightSql contract. Three concurrent protocol refinements tell one story: Flight SQL is carrying enough production traffic that the gaps between it and traditional database connectivity are getting filled one field at a time.

For readers newer to this corner of Arrow, SqlInfo is the mechanism a Flight SQL server uses to describe itself to clients, covering everything from supported SQL features to type behavior. A JDBC driver or a query tool reads those codes before it compiles SQL, so every missing code forces client-side guesswork. Dialect metadata is the difference between a tool that generates correct pagination syntax for each backend and one that ships per-database hacks. Small protocol changes like these are unglamorous, and they are exactly what turns a wire protocol into a platform.

The format side picked up a fresh proposal as well. Florian R. Hölzlwimmer, following a suggestion from Rok Mihevc on GitHub, opened a discussion on adding an arrow.range canonical extension type for bounded ranges. Arrow has no canonical representation for ranges today, and the thread drew five participants and six replies working through the design space. Canonical extension types are how Arrow grows its type system without touching the core spec, and ranges are a frequent request from the scientific and genomics communities where bounded intervals are everywhere.

On the release front, Jean-Baptiste Onofré posted a heads up that Arrow Java 20.0.0 preparation is underway and is triaging GitHub issues, inviting anyone with release candidates for inclusion to speak up now. The word cloud on the list this month also shows the steady drumbeat of Rust patch releases, with 56.2.1, 57.3.1, and 58.3.0 all in circulation.

Then there is the thread that best captures where open source is heading. Wes McKinney, the project's co-creator, wrote in about the status of Arrow Conbench data and the Conbench OSS project. He noticed conbench.ursa.dev has been down, needs continuous project benchmarks again, and is interested in doing development on Conbench, with his AI agents doing the development work. Conbench is the continuous benchmarking framework the Arrow community built to catch performance regressions commit by commit, and a hosted instance going dark means the project loses one of its early-warning systems. For a library whose entire value proposition is speed, continuous benchmarks are not a nice-to-have, so reviving the tooling matters beyond nostalgia.

Read the agent mention twice, though. The person who started Arrow is now describing agent-driven contribution as a casual aside in an infrastructure email. Combined with the auto Copilot review discussion Sutou Kouhei opened in late May, Arrow is becoming the clearest case study of an Apache project absorbing AI into its daily workflow.

Community logistics filled out the week. Ian Cook announced ADBC Office Hours on June 11, hosted with Columnar and featuring David Li, Curt Hagenlocher, and Felipe Oliveira Carvalho, and reminded everyone of the biweekly community meeting on June 3. Rich Bowen shared next steps for the Community over Code Glasgow 2026 hackathon, confirming Arrow's participation in the October event.

Apache Parquet

Parquet's week was about governance in the deepest sense, with the community asking how the format itself should version and evolve. Daniel Weeks opened the big one, a discussion on the future of Parquet versioning that he promised at a recent community sync. Twelve replies from seven participants worked through how the format signals capability to readers and writers, a question that has circled the project for years and gains urgency every time a new feature like geometry types or variant lands. Versioning sounds dry until you remember that every engine, every language implementation, and every stored file on earth has to agree on what a version number means. The hard part is that Parquet's installed base is effectively permanent. Files written a decade ago still get read every day, and no version scheme can assume writers and readers upgrade together. The discussion has to balance a reader's need to know whether it can safely consume a file against a writer's need to adopt new encodings without waiting years for the ecosystem to catch up. How the community answers will shape how fast recent additions like new types and the footer work reach production deployments.

The path_in_schema work crossed a threshold this week. Ed Seidl posted an update on making ColumnMetaData.path_in_schema optional, reporting that a third proof-of-concept implementation now exists in arrow-cpp and that a test file written without the field, created with arrow-rs, has been submitted to parquet-testing. With three implementations proving the change works, he then opened the formal vote on GH-563, drawing seven participants and six replies. The field repeats schema information in every column chunk's metadata, so making it optional trims fat from footers in wide tables, and the careful PoC-first process is a model for how format changes should land. The mechanics explain the payoff. path_in_schema carries the full column path inside each chunk's metadata, information the footer schema already holds once. In a table with thousands of columns across many row groups, that duplication adds real bytes to every footer and real time to every metadata parse. Letting writers drop it, with arrow-cpp now the third implementation to prove the change alongside the earlier proofs of concept and an arrow-rs-written file landing in parquet-testing, is the diligence that makes a format-level change safe.

Release energy built on two fronts at once. Gang Wu opened a discussion on releasing parquet-format 2.13.0, noting that about nine months have passed since 2.12.0 shipped on August 28, 2025, and that meaningful updates have accumulated since. Five participants weighed in across six replies. The same day's energy carried to the Java side, where Fokko Driesprong proposed Apache Parquet 1.18.0, pointing out the project is well past its quarterly release rhythm with a lot of accumulated work. And Ismaël Mejía proposed bumping the minimum Java version to 17 for Parquet Java, since Java 17 has been the baseline LTS since September 2021, nearly five years ago. Iceberg made the same move in its 1.11 cycle, so the lakehouse Java stack is converging on 17 as its floor.

Two long-running technical threads advanced. Rahul Sharma revived the INT96 stats discussion with a concrete plan to land Option 1 from Micah Kornfield's earlier summary, keeping INT96 ordering undefined in the format while letting readers opt in through an allow-list, and he has an open parquet-java PR to do it. INT96 timestamps are deprecated but far from dead in stored data, so pragmatism beats purity here. And Russell Spitzer nudged the discussion on a new File logical type, asking whether the proposal mentioned at the last sync exists yet, in the thread Burak Yavuz started in April. A File type gives Parquet a first-class way to reference external content, which matters more as multimodal and AI workloads push files-about-files into analytics tables.

Community mechanics stayed healthy. Julien Le Dem ran the June 3 sync and then asked for a volunteer facilitator for the June 17 sync while he is on vacation, with three replies already sorting out coverage. Andrew Bell asked the evergreen newcomer question, where to find test files to validate a reader, and got pointed at the project's testing resources. Micah Kornfield led the month's activity with 16 messages, with Gang Wu close behind at 14.

Apache DataFusion

New to this newsletter, Apache DataFusion is the Rust-native query engine that increasingly powers the execution layer of the lakehouse, and its dev list runs at a pace that fits its roughly monthly major release cadence.

A quick orientation for readers meeting the project here for the first time. DataFusion is an embeddable query engine written in Rust and built on Apache Arrow's columnar memory model. Where Iceberg defines tables, Polaris catalogs them, and Parquet stores them, DataFusion reads, plans, and executes queries over all of it, and a long list of commercial and open source systems embed it as their execution core. The project started life inside Arrow before graduating to its own top-level Apache project, and its subprojects extend the engine in different directions. Comet accelerates Spark by translating Spark physical plans into DataFusion execution. Ballista distributes execution across nodes. A growing set of language bindings carries the engine beyond Rust, and this week showed exactly why that structure earns DataFusion a permanent slot in this newsletter.

The week proved the point on cadence too. Andrew Lamb cut the release-54 branch on May 21 and, by June 4, had the vote open on DataFusion 54.0.0 RC1, based on commit 45d943df, with six participants already verifying the candidate. From branch cut to release candidate in two weeks is normal operating speed for this project, and it is worth pausing on how unusual that is for a foundation project with this many downstream consumers.

The bigger strategic story is the JVM. Andy Grove, who drove the month with 18 messages, announced that the vote on Apache DataFusion Java 0.1.0 RC1 passed with five +1 votes, three of them binding, making it the first release of the new DataFusion Java subproject. The bindings, which Andy seeded in mid-May as a minimal JNI bridge that registers Parquet tables and executes queries from the JVM, plan and run everything in native Rust and hand results back to Java. He owned a timing mistake in the vote process with characteristic transparency, and the release stands. The significance is hard to overstate for this audience. The data ecosystem's center of mass still runs on the JVM, and DataFusion Java gives every Java shop a path to Rust-speed query execution without leaving their stack. The 0.1.0 scope is deliberately small, enough to register Parquet tables and run SQL and DataFrame queries end to end, which is the right way to start a binding. Ship the thin slice, prove the JNI boundary holds, and grow the API with real users instead of guessing at one in advance. Anyone who watched PyIceberg or the Iceberg Go client grow from similar seeds knows how quickly a minimal binding becomes load-bearing infrastructure.

Ballista, the distributed execution subproject, had a full week of its own. Marko Milenković announced that the Ballista 53.0.0 release vote passed with five votes, four binding. Andy Grove published a test version of Ballista to test.pypi.org and asked for help verifying the first Ballista PyPi release, which brings distributed DataFusion within pip-install reach. And Martin Grigorov relayed the team's proposal to drop the Windows CI workflows for Ballista, citing two reasons: the foundation-wide discussion about runner consumption and the lack of demonstrated interest in better Windows support for Ballista.

That first reason connects to the same letter Iceberg received. Robert Thomson wrote to the DataFusion PMC about the project's consumption of ASF shared GitHub-hosted runners, with the shared pool at or very close to its limit under the foundation's 2024 GitHub Actions policy. Five replies in, DataFusion is already acting, and the Ballista Windows decision shows what the response looks like in practice. Projects are starting to treat CI minutes as a budget line and cut the platforms and matrices that do not earn their cost.

Design work continued underneath the release activity. Gene Bordegaray proposed introducing a Range partitioning variant to the engine. DataFusion currently models partitioning as Hash, RoundRobinBatch, or UnknownPartitioning, and that vocabulary cannot accurately represent some real partitioning schemes, which limits what the optimizer can prove about data layout. A Range variant lets the engine reason about ordered, range-partitioned data, which is exactly the shape of most lakehouse tables. Andy Grove also surfaced a discussion about adding geospatial support in Comet, capturing a conversation that started in a now-closed PR so the wider community can weigh in. Comet, the Spark accelerator that translates Spark physical plans into DataFusion execution, is heading toward a 1.0.0 release targeted for July or August, with a proposed versioning policy and a plan to drop Spark 3.4 support under discussion since mid-May. Geospatial functions in Comet promise accelerated spatial analytics inside existing Spark deployments, no migration required.

For readers new to the project, the week is a fair sample of why DataFusion now belongs in this newsletter. One engine shipped a release candidate, a new language binding cut its first release, a distributed runtime shipped and reached for PyPi, a Spark accelerator marched toward 1.0, and the core team debated partitioning semantics, all in seven days.

Cross-Project Themes

The clearest cross-project signal this week came from outside the projects entirely. Robert Thomson delivered the same message to both the Iceberg and DataFusion PMCs in nearly identical letters: the ASF's shared pool of GitHub-hosted runners has been at or near its limit, and the 2024 GitHub Actions policy is now a constraint these communities have to plan around. The responses are already visible. Ballista is dropping Windows CI, and Iceberg's fourteen-reply thread reads like the start of a real CI budget process. Put this next to the engine-version threads, with Iceberg debating Spark and Flink support windows, Polaris untangling Spark 3 and 4 code organization, Parquet raising its Java floor to 17, and Comet dropping Spark 3.4, and a single picture emerges. The lakehouse stack's support matrix has grown faster than the infrastructure that tests it, and 2026 is the year the bill arrived. Expect narrower version windows and leaner CI matrices across all five projects by year end.

The second theme is the release train. Counting this week alone, the five projects had a release in active vote (DataFusion 54.0.0), a first-ever release completed (DataFusion Java 0.1.0), a release passed (Ballista 53.0.0), a release scheduled (Polaris 1.6.0 for June 26), two release discussions opened (parquet-format 2.13.0 and Parquet Java 1.18.0), a release in preparation (Arrow Java 20.0.0), a release plan forming (Iceberg C++ 0.3.0), and a release tracking issue opened (iceberg-rust 0.10.0). The post-1.11 lull some expected never happened. The ecosystem ships continuously now, and the projects that built lightweight release machinery, DataFusion above all, set the pace the others are converging toward.

The third theme is quieter but more consequential. AI is moving inside the projects' own workflows. Wes McKinney mentioned, almost in passing, that his agents will do the development work on Conbench. Jean-Baptiste Onofré is restructuring how Polaris writes proposals so AI tools can participate in authoring and review, and the Directories proposal is the first test. Arrow spent late May debating automated Copilot review of pull requests. None of these communities is debating whether to use AI anymore. They are debating where it fits in governance, and that is a different and more mature conversation. The interesting question for the rest of 2026 is whether the foundation develops shared norms here, the way it did for release votes and IP clearance, or whether each project keeps writing its own rules. The volume of AI-assisted contributions is only going up, and the projects that decide their policies now, calmly and in public, will handle that volume better than the ones that wait for an incident to force the question.

A final, smaller observation: watch the people who span projects. Tornike Gurgenidze drove a Flight SQL protocol proposal in Arrow and a credential-vending SPI refactor in Polaris in the same week, and Huaxin Gao moved a REST spec vote in Iceberg while converging the idempotency design in Polaris. The lakehouse is one stack, and its most effective contributors increasingly work it that way.

For practitioners, the week's takeaway is about timing. The next ninety days bring DataFusion 54, Polaris 1.6.0, likely Parquet releases on both the format and Java sides, Arrow Java 20.0.0, and the first wave of post-1.11 Iceberg subproject releases. Teams planning platform upgrades get a rare window where the whole stack refreshes together, and the version-support threads above are advance notice of which older engine combinations are about to fall off the supported list. Reading the dev lists this week was cheaper than reading the release notes next quarter.

Looking Ahead

The DataFusion 54.0.0 vote should resolve within days, and the Parquet path_in_schema vote on GH-563 is the format change to watch. The Arrow is_update vote runs an extra week per JB's suggestion, and the Erlang donation moves to the incubator general list for IP clearance. Polaris has a packed June, with the Console repository decision, the Directories proposal review on PR #4613, and the 1.6.0 release targeted for June 26. The Parquet sync on June 17 still needs a facilitator, and ADBC Office Hours land on June 11. On the Iceberg side, the column update threads are the design work that will define the next spec cycle, and the ASF runner discussions on both the Iceberg and DataFusion lists deserve attention from anyone whose CI depends on foundation infrastructure. Next week also brings the Arrow community's next checkpoints on the Conbench revival and the arrow.range extension type discussion, plus whatever follow-up the Polaris Directories proposal draws once reviewers digest PR #4613. Keep an eye on the Parquet versioning thread above all, because its outcome touches every project in this newsletter.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free and build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow and learn how Dremio brings the open lakehouse stack together

Free Downloads