DEV Community

Cover image for Apache Data Lakehouse Weekly: June 24 to July 1, 2026
Alex Merced
Alex Merced

Posted on

Apache Data Lakehouse Weekly: June 24 to July 1, 2026

The open lakehouse runs on a small stack of Apache projects, and this was a week where those projects spent most of their energy on the boring work that makes software trustworthy. Iceberg voted to lock down the meaning of expressions and named identities for functions, then turned around and asked a harder question: when five different codebases all claim to follow the same spec, how do we prove they agree? Polaris worked through the plumbing of running one catalog on many databases without a rebuild, welcomed a new committer, and failed a release vote for the right reasons. Parquet dug into what a version number even means once features ship faster than releases. Arrow rebuilt its benchmarking service, partly with an AI agent doing the typing. DataFusion cut a clean release of its Python bindings. Below is what the community built and argued about, and why each thread matters for anyone running data on open formats.

If you are new to this stack, here is the quick map. Parquet is the file format that stores your data as columns on cheap object storage. Arrow is the in-memory format that moves those columns between tools fast. Iceberg is the table format that turns a pile of Parquet files into a real table with schema changes, time travel, and safe concurrent writes. Polaris is the catalog that keeps track of which tables exist and who can read them. DataFusion is a query engine that runs SQL over all of it. These five projects fit together into what people call the open lakehouse, a way to run analytics and AI on open standards instead of a single vendor's closed system. When these projects agree on a spec, your data stays portable. When they drift, you get locked in by accident. That is why the correctness work below matters as much as any flashy feature.

Apache Iceberg

The headline this week was a vote, and it was a big one. Ryan Blue opened a vote to adopt the new expressions spec, a document that defines the minimal structure and behavior of expressions in Iceberg. An expression is the part of a query that filters or transforms data, the piece that says "where the date is after January first" or "where the region equals west." For years each Iceberg implementation carried its own idea of how expressions should behave. Writing that behavior down in a shared spec sounds dull until you realize what it unblocks. Once every engine agrees on exactly what an expression means, features that depend on precise filtering can move forward without each team guessing at the details. The vote drew strong support fast. Steven Wu and Szehon Ho gave binding plus-ones, with Szehon calling the definition elegant after reading it through. Anoop Johnson, Manu Zhang, and Andrei Tserakhau added non-binding support, and Manu said plainly that he was excited about the use cases the spec opens up. Thirty-two messages later, the thread stood as one of the most active of the week across every project on this list.

Right behind it, Szehon Ho started a vote to add a specific-name field to the UDF spec. A UDF is a user-defined function, a bit of custom logic a user writes and then calls by name inside SQL. The problem Szehon set out to fix is a familiar one from the SQL standard. A single function name can have several versions that each take different inputs, and a catalog needs a way to point at one exact version rather than the whole family. The specific-name field gives Iceberg that pointer, matching the concept from the SQL spec directly. Yufei Gu, huaxin gao, Yuya Ebihara, Ryan Blue, Manish Malhotra, Prashant Singh, and Russell Spitzer all weighed in. Taken together with the expressions vote, the week showed a project methodically pinning down the small pieces of behavior that a mature standard needs before the fancy features on top of it can be built safely.

The most interesting design conversation was not a vote at all. Neelesh Salian opened a discussion on cross-implementation conformance testing, and it struck a nerve. Iceberg now has five separate codebases in five languages: Java, Python, Rust, Go, and C++. Each one ships its own tests. What none of them share is a way to check that a table written by one gets read the same way by another. Neelesh framed the gap clearly, and Matt Topol jumped in to say this is something he had wanted for a long time, citing case after case where implementations quietly disagreed. Tanmay Rauth put the real risk into words that stuck: the hardest problems are not the outright bugs, they are the cases where two implementations both look correct and still produce different results. Danny Jones said his team had already been building similar test sets and welcomed a shared reference. The value of a physical reference artifact, a real table that everyone tests against, came up again and again.

That thread did not stand alone. Sung Yun proposed a shared cross-language test fixtures repository called iceberg-testing in the same window, driving at the same problem from a slightly different angle. He named the same five language implementations and the same slow drift in how each one reads the spec. Anurag Mantripragada connected the two threads directly and asked whether they were really one effort. Sung agreed the proposals overlapped and said the two of them had synced offline to converge. This is healthy community behavior worth pointing out. Two contributors saw the same gap, wrote it up independently, noticed the overlap, and started merging their work rather than competing. The result should be a single shared test suite and a set of reference tables that every Iceberg implementation checks itself against. For teams that run mixed engines, a Rust reader here and a Java writer there, that guarantee is the difference between trusting your data and hoping it lines up.

A meatier spec debate ran through the column update file representation thread. The question was how to store updates to individual columns, and the choice came down to dense versus sparse layouts. Steven Wu argued that supporting both options forces every engine to implement the more complex sparse read path, which raises the cost for everyone. Gábor Kaszab agreed that the case against a dense-only layout was not strong and that dense is more straightforward to implement across languages. Andrei Tserakhau made the sharpest point in favor of picking one and mandating it. Dense, he noted, is just a special case of sparse, so the two are not symmetric. Allowing both means every reader carries the heavier code even when the data never needs it. The thread leaned toward mandating a single dense representation now, with room to add column families later for teams that want separate files per group of columns. This is the kind of decision that never makes a headline and shapes performance for years.

Housekeeping got real attention too. Kevin Liu proposed cutting continuous integration time by running the JDK 21 test suite only on the main and nightly branches, keeping JDK 17 on every pull request. Continuous integration is the automated system that runs the full test suite on each proposed change. Running two Java versions on every pull request burns a lot of shared compute, and Iceberg has been watching its usage of the ASF pool of GitHub-hosted runners. Russell Spitzer asked the practical question of who gets alerted when a nightly build breaks on Java 21, and Kevin pointed to the GitHub interface and the continuous integration notification list. Danny Jones floated GitHub merge queues as an alternative, and Russell said turning off the extra run is simply cheaper, since a change that breaks only Java 21 and not Java 17 is rare. Ajantha Bhat tied it back to the broader Iceberg consumption of ASF shared runners thread and confirmed a first merged step toward using one Java version for pull request checks. Small change, real savings, and a sign of a project that has grown large enough to care about its compute footprint.

On the release front, Danny Jones called a vote to release Apache Iceberg Rust 0.10.0 RC1, with Manu Zhang, Rich Bowen, and L. C. Hsieh among those checking the candidate. The Rust implementation keeps shipping at a steady clip, and its progress is part of why teams that want Iceberg without a Java runtime now have a real path. Kevin Liu and Amogh Jahagirdar also sorted out whether to cut 1.11.1 and 1.10.3 patch releases on the Java side, working through which fixes belong in which milestone so the production branches stay clean while newer work continues on the main line.

Several forward-looking proposals landed that point at where Iceberg is headed. Talat Uyarer opened a discussion on a FileRef type for unstructured objects, aimed at letting tables reference images, video, and machine learning artifacts through catalog-brokered access rather than raw paths. Jean-Baptiste Onofré welcomed the idea and pointed to a parallel discussion already running in Polaris, a reminder that the two projects share a lot of surface. On the transactional side, Matt Butrovich pointed a new contributor toward the ongoing work on first-class primary key tables, part of a wider push to add constraint support, including primary keys, to a format that started life as an append-friendly analytics store. huaxin gao summarized the latest index support sync and shared a decision that reads like a small philosophy statement: an index is not a table, it is its own kind of object that reuses table machinery under the hood. William Hyun proposed file-level access delegation in the REST catalog spec, since delegated access today is scoped to whole tables and some workloads need finer control during scan planning. Walaa Eldin Moustafa cross-posted to the Iceberg and Spark lists to gather input on how Spark should route queries against Iceberg materialized views. Andrei Tserakhau moved the collation support discussion from talk to code with a spec-change pull request and reference implementations in both Go and Java. Sunmin Lee raised a geospatial design question about declaring row-level bounding-box covering columns that mirrors the GeoParquet bbox pattern. And Tomohiro Tanaka asked for feedback on a table_properties_log metadata table that exposes the history of table property changes. Read as a group, these threads show a format stretching in three directions at once: toward unstructured data, toward transactional guarantees, and toward richer types, all while the votes above keep the core precise.

Apache Polaris

Polaris had the busiest mailing list of the week by raw volume, and the through-line was a project learning to say no to complexity. Dmitri Bourlatchkov opened a discussion on modular design for new features that set the tone. Polaris has been drawing a flood of interesting proposals, which is good news for a young project. The flip side is that every new feature bolted into the core makes the whole system harder to keep stable and simple to run. Dmitri asked the community to think about how to add capabilities without turning the codebase into a tangle. Russell Spitzer agreed that features should not be coupled into core and runtime in ways that are hard to unwind. Dmitri then pushed back on his own framing, saying he was not convinced every new proposal needs its own isolated Gradle module and staged rollout, a nice example of a maintainer arguing against overcorrection. Anand Kumar Sankaran brought a real-world angle from Workday, which consumes Polaris as a set of Maven dependencies and layers custom authentication and listeners on top. Yufei Gu and Robert Stupp joined the debate over when a feature earns its own module and when that is just proliferation for its own sake. The conclusion trended toward judgment over blanket rules, which is the right answer even if it is harder to enforce.

That governance thread was not abstract. It was the backdrop for the semantic layer support discussion, one of the week's richest at eleven messages. The plan is to store Open Semantic Interchange data as Polaris entities, giving the catalog a home for the business definitions that sit above raw tables. Dmitri worried about hard dependencies running from the runtime and service layers into the new semantic API implementation. Yufei Gu asked why an empty HTTP layer for a disabled feature is a problem, suggesting a plain 404 or 501 response when the feature is off. Alexandre Dutra said he is not a fan of gating an entire API behind a feature flag, and raised a security angle: if a vulnerability shows up in code that ships unconditionally to every user, everyone is exposed even if they never turn the feature on. Romain Manni-Bucau split the question into code modularity, where he saw little debate, and the harder question of what the default Docker image should contain. Yufei countered that a security fix generally lands against the whole project regardless of where an API lives, so the vulnerability argument does not cleanly favor modules. The debate stayed civil and specific, and it fed straight into a vote. Yufei Gu called a vote to accept the OSI Semantic Model API specification, which introduces the initial scaffolding for that semantic model work.

Two threads dug into how Polaris talks to catalogs and where its boundaries sit. Alexandre Dutra led a discussion on non-IRC endpoints in IRC config responses, asking whether the Iceberg REST config endpoint should double as a universal capability discovery tool for Polaris. He and Dmitri agreed that repurposing the config endpoint for everything looks like a misuse of its intent. Dmitri added a technical caveat about the endpoints logically sitting under the catalog base URI at /api/catalog, and the two worked through how policy endpoints and generic table endpoints fit that structure. Yufei Gu proposed extracting the config implementation out of the Iceberg catalog handler so it can serve broader needs. This is the sort of boundary work that keeps a catalog from turning into a junk drawer of unrelated APIs.

On the operational side, Alexandre Dutra opened a discussion on supporting multiple datasources with runtime activation. The goal is simple to state and useful in practice: let an operator switch the backing database, say from PostgreSQL to MySQL, without rebuilding Polaris from source. Yufei Gu asked whether Polaris should manage its own connection pools with a tool like HikariCP. Dmitri framed Alexandre's pull request as an incremental technical improvement to the Quarkus server that does not force any redesign but opens the door to more flexible deployments later. Russell Spitzer asked for clarity on exactly what the change delivers, and Alexandre confirmed the assessment and separated his work from a parallel MySQL effort so the two do not collide. This connects to a longer cleanup arc. In a related thread, Alexandre also moved forward on deprecating the TreeMap-based metastore and its companions for eventual removal, part of a push toward a cleaner default persistence story. Robert Stupp separately revived the discussion on replacing MinIO for S3-compatible storage on the test side, weighing which backend best serves getting-started examples versus test suites.

Polaris also spent time on a real infrastructure question that touches every Java project in the ecosystem. Robert Stupp opened a discussion on Jackson 3 readiness, the widely used library for reading and writing JSON. Robert was careful to say this is not about jumping to Quarkus 4 right now, since Polaris still runs on Quarkus 3. The aim is to prepare for Jackson 3 gradually so the project avoids one giant risky upgrade later. Alexandre Dutra liked the incremental path and asked to understand the impact better. Romain Manni-Bucau suggested a different direction: lean on the JSON-P and JSON-B standards as the API so any vendor can supply the implementation, rather than binding deeply to one JSON library. Jean-Baptiste Onofré agreed the two are related efforts and saw value in the standards approach. Robert wanted to nail down the Polaris-specific impact first, and noted the Iceberg side will need the same conversation eventually. Boring on the surface, load-bearing underneath: choices like this decide how painful the next five years of upgrades will be.

The release story taught a small lesson in doing things right. Jean-Baptiste Onofré voted minus one, binding, on the Apache Polaris 1.6.0 rc0 candidate after finding a missing Spark bundle artifact and some LICENSE issues in the source distribution. EJ Wang had cut the candidate, and the vote did not pass. A failed release vote is not a failure of the project. It is the process working. The checks caught real problems before they reached users. With EJ heading out on vacation, Jean-Baptiste volunteered to prepare the 1.6.0 rc1 candidate himself, and the release target sat around late June. That kind of hand-off, one contributor picking up another's work without ceremony, is what keeps a community project moving when any single person steps away.

There was good news for the people behind the code too. Jean-Baptiste Onofré announced Nandor Kollar as a new Polaris committer, and the congratulations poured in from Robert Stupp, Alexandre Dutra, Ajantha Bhat, Adam Christian, Dmitri Bourlatchkov, Yufei Gu, and Kevin Liu. New committers matter more than they seem to. Each one widens the group of people trusted to review and merge code, which spreads the load and speeds up the whole project. In the same warm register, Jean-Baptiste let the list know he was back after several weeks of travel and planned to return to his usual pace, drawing friendly replies from Kevin Liu, Danica Fine, and Keith Chapman.

A cluster of smaller design threads rounded out the week and showed the catalog maturing at the edges. Yufei Gu and Dmitri worked through a proposal for REST endpoints exposing table metrics and events, with Yufei flagging that the current query API shape is too tied to the example metrics in the Iceberg REST spec. Grace Chen requested reviews for the first phase of entity-level filtering for list operations, part of a visibility filtering proposal that decides which users see which entities. Dmitri argued for not exposing authorization denial details in 403 messages, preferring a random reference ID over leaking why access was denied, a sound security instinct. huaxin gao and Dmitri converged on Model B in the idempotency-key design for the Iceberg REST catalog, which stamps a key into the entity to make repeated requests safe. Two community threads also stood out: Rich Bowen reached out about recording a PlusOne interview, the ASF's short conversation series about project communities, and Kevin Liu flagged that the Slack invite link had expired again and needed a refresh, a tiny recurring friction that every growing community knows well. Underneath the STS token and vended credentials questions raised by evaluators testing federated catalogs, the pattern was consistent: real users are kicking the tires on Polaris federation and reporting back what is unclear.

Apache Arrow

Arrow's list was quieter this week, but two of its threads carry outsized significance for where data tooling is going. Wes McKinney posted an update on the Arrow conbench data and the conbench open source project, pointing the community to a rebuilt version at conbench-v2.arrow-dev.org. Conbench is the service that tracks Arrow's performance benchmarks over time and flags when a change makes something slower. The detail that makes this thread notable is how the rebuild happened. Wes said the summary of the work was written by Codex, the AI coding agent, and that much of the development ran unattended against a mandate to rebuild the backend. Rok Mihevc gave the kind of feedback that keeps a rebuild honest. He appreciated the darker color palette for late-night viewing but preferred the old interactive graph, which classified data points and drew a trendline with confidence bands. Wes agreed to bring the chart back in line with the old one and admitted he had not spent much effort on the interface yet. Antoine Pitrou listed what he cares about most in conbench: regression detection, which he called quite solid after a lot of tuning to the algorithm, and readable benchmark results. This thread is a small window into a larger shift. An AI agent rebuilt a core piece of an Apache project's infrastructure, and the humans reviewed it, pushed back on the parts that lost value, and merged the parts that worked. That is the collaboration model taking shape across the ecosystem, and it is worth watching closely.

It is worth sitting with why benchmarking infrastructure is the kind of thing a project guards so carefully. Arrow is the in-memory format that a huge slice of the data world uses to move columns of data between tools without copying and reformatting them. When a change makes Arrow even a few percent slower, that cost multiplies across every system that depends on it. Conbench is the early-warning system that catches those slowdowns before they ship. So when Wes handed much of the rebuild to Codex, he was handing an agent responsibility for a piece of infrastructure that protects the performance promises of the whole project. The fact that it worked, and that the review caught the places where the new interface lost useful detail like the classified points and confidence bands Rok wanted back, is a real data point about where agent-assisted maintenance stands today. The agent did the heavy lifting on an unglamorous backend rebuild, and the experienced maintainers decided what was good enough to keep. Neither replaced the other.

The second standout came from outside the usual contributor group. Sam Arch, a PhD student at Carnegie Mellon co-advised by Andy Pavlo and Jignesh Patel, announced an ADBC extension for DuckDB. ADBC is the Arrow Database Connectivity standard, a way for tools to move Arrow data in and out of databases without slow row-by-row conversion. Getting ADBC into DuckDB, the popular in-process analytics database, connects two fast-moving corners of the data world. Rusty Conover congratulated Sam on the release, praised the connection-profile integration, and said he plans to add support for it to his adbc_scanner project. Aldrin weighed in on the framing of a hand-coded claim in the announcement. The thread is short, but it signals healthy cross-pollination. Academic database research and the Arrow standards are meeting in a widely used tool, and the maintainers are already talking about how their projects connect.

The rest of Arrow's week was community maintenance, which matters more than it sounds. Robert Thomson reported on the project's use of ASF shared GitHub-hosted runners, noting that Arrow had dropped to twelfth in minutes consumed over the prior seven days, a real improvement, while framing runner usage as an ongoing discipline rather than a one-time fix. That echoes the same continuous-integration cost conversation happening in Iceberg, a shared pressure across the whole foundation. Nic Crane asked for extra help closing out old issues, describing automation that flags stale bug reports and the manual work she, Alenka, and Rok have been doing to clear the backlog. Ian Cook announced the biweekly Arrow community meeting for July 1 at 16:00 UTC. And Zehua Zou raised a cross-format question about allowing the VARIANT value field to be omitted, noting that parquet-format does not allow it while Arrow's documentation does. That last thread is a good bridge into Parquet, where the VARIANT type drew heavy attention this week.

Apache Parquet

Parquet spent the week wrestling with a question that sounds simple and is not: what should a version number mean? Daniel Weeks drove a long discussion on the future of Parquet versioning, the busiest thread on any list this week at seventeen messages. It grew out of a proposed change to how paths in the schema get handled, and it quickly became a debate about process. Daniel argued that the community should coordinate and agree on what belongs in a major version bump rather than forcing a new version every time an incompatible change lands. Micah Kornfield and Ed Seidl worked through the specifics, with Ed noting that his path-in-schema change was partly a test of the documented process for handling forward-incompatible changes. Andrew Lamb captured the core tension in a way worth repeating in plain terms. One option is to keep using version numbers, which people understand and which the rest of the industry uses, but which are a blunt instrument, since touching any single feature of a new version can seem to require the whole version. The other option leans on per-feature signaling, which is precise but less familiar. Antoine Pitrou and Ryan Blue added their perspectives, and the thread did not fully resolve, which is fine. Deciding what a version promises is the kind of question a format needs to answer carefully once, because everyone downstream lives with the answer.

That versioning debate had two direct offshoots. Micah Kornfield proposed moving parquet-format releases to semantic versioning, with the concrete idea of a major version bump every time a release includes an incompatible change. And Andrew Lamb reported progress on documenting which features live in which versions of Parquet, announcing that a merged pull request put a new explanatory page live on the website and that a second pull request renders the feature table automatically. This pairing is the practical answer to the abstract debate. If you cannot easily tell which features a version contains, the version number carries less meaning, so writing that mapping down and keeping it current is real progress.

The release process moved forward on a specific spec change. The vote on GH-583 to define ordering for INT96 timestamps, started by Divjot Arora, gathered binding plus-ones from Ryan Blue, Micah Kornfield, Gang Wu, and Daniel Weeks, with Andrew Lamb adding support. INT96 is an old ninety-six-bit timestamp type that Parquet inherited from its early days. It has been deprecated for years, but real files in the wild still use it, so pinning down exactly how those timestamps should sort matters for anyone reading legacy data. Micah noted that Ryan's support implied comfort with the Java implementation, whose review was still in progress, and that Ed Seidl was handling the Rust side. Andrew added that older versions of arrow-rs that panic on this data can get patched releases if the problem shows up in practice. Ryan was careful to separate his vote on the spec direction from the still-open Java code review. This is a good example of how Apache projects split the question of what to build from the question of whether a specific implementation is ready.

The most exciting announcement came from Gunnar Morling, who shared the release of Hardwood 1.0, a new Parquet reader for the JVM. Hardwood is built from the ground up to keep external dependencies to a minimum, with a writer planned to follow. Fewer dependencies means fewer transitive security vulnerabilities to chase, a point Steve Loughran made right away in his congratulations. Steve, who has an open pull request to harden variant parsing, asked how Hardwood handles the VARIANT type. Gunnar ran Steve's test fixtures through Hardwood and reported that it rejects the malformed cases except for one depth case that lacks a guard so far. Pritam Pan asked whether Hardwood integrates with Apache Spark down the road, and Gunnar said he was not aware of any such discussion and did not want to speak for the Spark side. A fresh, lean reader for one of the most widely used file formats in data is good for the whole ecosystem, since it gives teams another well-built option and keeps the incumbents honest.

VARIANT hardening was a running theme. Steve Loughran opened a discussion on how deep a realistic variant depth is, tied to his pull request that adds shallow validation of variant inputs in parquet-java. The VARIANT type stores flexible, semi-structured data, which is powerful and also a place where malformed input can cause trouble if a reader trusts it blindly. Kurtis Wright asked a sharp question: is Parquet the right layer to build reader guardrails that writers can choose to ignore, or does that belong somewhere else? Gunnar Morling argued that a parser should be able to reject malformed payloads on its own, since not all Parquet use runs through Iceberg, so building the guardrails into the parser makes sure they apply everywhere. This connects to Kevin Liu and Micah Kornfield's separate thread on how older parquet-java readers should behave when they hit VARIANT columns, where version 1.15.x fails on a newer logical type. Together these threads show a format working out how to add powerful new types without breaking the readers already deployed across thousands of systems.

Several more threads filled out a productive week. Rok Mihevc moved to introduce a FIXED_SIZE_LIST logical type based on benchmarks and design-doc feedback, giving Parquet a native way to describe lists with a fixed number of elements. Alkis Evlogimenos reported that the FILE proposal is in good shape, with path, size, offset, and etag in place and an active thread on adding content_type. Micah Kornfield nudged forward an AI tooling policy for Parquet, suggesting the community open a pull request with the current draft, a sign that Parquet, like others, is writing down how it wants AI-generated contributions handled. Russell Spitzer supported adopting AssertJ for test assertions as a gradual improvement for consistency. Jiayi Wang shared written recaps from the Parquet Footer Working Group's third session. And the community looked for a facilitator for the July 1 Parquet sync when Julien Le Dem flagged he will be out. For a format that many people think of as finished, Parquet is very much still evolving.

Apache DataFusion

DataFusion kept it focused this week with a clean release effort. Tim Saucer called a vote to release the DataFusion Python bindings, version 54.0.0, on release candidate two. DataFusion is a fast query engine written in Rust, and its Python bindings let data scientists and engineers drive that engine from Python without touching Rust. The vote passed with binding support from Matt Butrovich, who verified on macOS, Andrew Lamb, who ran the checks on an M3 Mac and reviewed the changelog, L. C. Hsieh, who verified on an M4 Mac, and Adrian Garcia Badaracco, who tested on an M4 MacBook Pro. Renato Marroquín Mogrovejo added non-binding support. Andrew also thanked a contributor named Nuno for helping review many of the pull requests in the release.

A short thread, but a healthy one, and it is worth explaining what the ceremony is doing for readers new to Apache. A release candidate is a proposed final build that has not shipped yet. Before it becomes official, members of the project download it, verify the cryptographic signatures and checksums, check the license files, and run the build on their own hardware. A binding vote comes from a project committer whose plus-one counts toward the official tally, while a non-binding vote is a welcome check from anyone else in the community. The reason four people ran the same checks on four different Macs is that a release only earns trust when independent people confirm it builds and passes on machines the release manager does not control. This is the same discipline that caused Polaris to reject its 1.6.0 rc0 candidate over a missing artifact. The vote is not a rubber stamp. It is the community putting its name on the build.

The steady cadence of DataFusion releases is part of why the Rust data stack keeps gaining ground, and the Python bindings are the on-ramp that brings that speed to the analysts who never leave their notebooks. It also lines up with a quiet cross-project pattern this week. Iceberg cut a Rust release candidate, DataFusion cut a Rust-backed Python release, and the Arrow ecosystem moved ADBC into DuckDB. Rust keeps showing up as the language teams reach for when they want the speed of native code without a Java runtime, and Python keeps showing up as the surface those teams expose to their users. For anyone deciding what to build a data platform on, the message is that the fast open engines and the friendly high-level interface are no longer a trade-off. You can have both, and the release votes this week are the receipts that the pieces are production-ready.

Cross-Project Themes

Three patterns ran across the lists this week, and each one says something about where the open lakehouse is heading.

The first is a shift from building features to proving correctness. Iceberg's expressions and UDF votes, its conformance testing and shared test fixtures threads, and Parquet's fight over what a version number means all point the same direction. These projects have enough implementations and enough production users that the community can no longer trust everyone to interpret the spec the same way. So the work turns toward writing behavior down precisely, then building shared tests that prove the implementations agree. This is what maturity looks like in open standards. The exciting phase of adding capabilities gives way to the harder phase of guaranteeing that a table written anywhere reads correctly everywhere. For anyone betting a business on open formats, that guarantee is the whole point.

The second theme is the quiet arrival of AI agents inside the projects themselves. Wes McKinney rebuilt Arrow's conbench service with heavy help from Codex and said so openly, then let human reviewers push back on the parts that lost value. Parquet started drafting an AI tooling policy. These are not press releases about AI. They are working engineers folding agents into real maintenance and being transparent about it, while keeping human judgment in the loop. The lakehouse community is also the community building the data layer that agents run on, so it makes sense that these projects are among the first to work out the norms for agent-assisted contribution. Expect more projects to write down how they want AI-generated code reviewed and merged.

The third theme is shared infrastructure discipline. Iceberg and Arrow both spent real energy this week on trimming their use of the ASF pool of GitHub-hosted continuous-integration runners. Iceberg moved to run fewer Java versions on pull requests. Arrow celebrated dropping to twelfth in minutes consumed. This is a foundation-wide pressure, since every Apache project draws from the same limited pool, and it shows these communities acting like responsible neighbors rather than maximizing their own convenience. The VARIANT type also crossed project lines all week, showing up in Arrow, in Parquet's hardening and compatibility threads, and in the format-level question of whether a value field can be omitted. When a single type sparks parallel discussions on three lists, it is a sign that semi-structured data is becoming a first-class citizen of the lakehouse, and the community is working out the rules together.

Looking Ahead

Watch the Iceberg conformance testing and iceberg-testing threads to see whether the two converged proposals produce a real shared test suite, since that is a meaningful step toward guaranteed cross-engine correctness. Keep an eye on Polaris 1.6.0, where a fresh release candidate should follow the failed rc0 vote, and on the semantic layer work now that the OSI Semantic Model API specification is up for a vote. Parquet's versioning debate is unlikely to end quickly, so the semantic-versioning proposal and the feature-documentation page are the concrete pieces to track. And the VARIANT hardening work across Parquet and Arrow is worth following for any team that stores semi-structured data, since the guardrails being designed now will shape how safely that data moves between tools. The unifying story is a set of projects growing up together, trading raw feature velocity for the correctness and stability that production workloads demand.


Resources & Further Learning

Get Started with Dremio

Free Downloads

Books by Alex Merced

Top comments (0)