<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Merced</title>
    <description>The latest articles on DEV Community by Alex Merced (@alexmercedcoder).</description>
    <link>https://dev.to/alexmercedcoder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F288069%2Fb20116a9-b178-4ab1-bcb0-8aa28ed732b0.png</url>
      <title>DEV Community: Alex Merced</title>
      <link>https://dev.to/alexmercedcoder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexmercedcoder"/>
    <language>en</language>
    <item>
      <title>Apache Data Lakehouse Weekly: June 9 to 16, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Tue, 16 Jun 2026 21:05:51 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-june-9-to-16-2026-5dj1</link>
      <guid>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-june-9-to-16-2026-5dj1</guid>
      <description>&lt;p&gt;This was a week of votes that passed and arguments that opened. Iceberg shipped three formal decisions while debating how to cut read latency for V4. Polaris worked through the unglamorous plumbing of error codes, persistence, and retention. Parquet reopened the oldest question in its history, what a version number should mean, while shipping a new release at the same time. Arrow advanced variant support and welcomed a new language binding. DataFusion grew its leadership and set a roadmap. Underneath all of it ran one shared headache that touched four projects at once: who pays for the CI compute.&lt;/p&gt;

&lt;p&gt;Read the five lists together and a theme jumps out. This was a maturation week, not a launch week. Almost nothing shipped that a marketing team would put on a banner. What shipped instead was the deep, careful work that decides whether these projects can hold up production workloads for the next several years. Error codes, retention policies, field-id semantics, version numbering, test backends, CI budgets. None of it is exciting. All of it is the difference between software you experiment with and software you bet a business on. If you run a lakehouse, weeks like this one are the weeks that earn your trust, even though they make for quiet headlines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;Iceberg spent the week closing votes. Three separate decisions reached a result, each one a small commitment that shapes the format and its implementations for years.&lt;/p&gt;

&lt;p&gt;The C++ implementation crossed a milestone. After a bumpy run through release candidates, &lt;a href="https://lists.apache.org/thread/sho8dfg1bc868fvtkp0fd13cbjvcgg1j" rel="noopener noreferrer"&gt;Apache Iceberg C++ 0.3.0 RC3 passed its vote&lt;/a&gt;, with Junwang Zhao driving the release and Gang Wu thanking the voters once it cleared. Binding and non-binding plus-ones came in from Kevin Liu, Renjie Liu, Neelesh Salian, Alex Stephen, and others, several of whom ran the verify script on macOS before signing off. The road there ran through an earlier &lt;a href="https://lists.apache.org/thread/xxh9mxn8jpc1js6ycy86orvkn3xfg6b3" rel="noopener noreferrer"&gt;RC2 attempt&lt;/a&gt; that did not make it, which is the normal rhythm of a young codebase finding its footing. A native C++ Iceberg matters because it removes the JVM from the picture for engines and tools written in C++ and Rust, and it feeds directly into work like iceberg-cpp reaching V3 feature completeness.&lt;/p&gt;

&lt;p&gt;The spec gained a small but real clarification. Kevin Liu ran a &lt;a href="https://lists.apache.org/thread/tslzhzyszxvnd16j6c53w650mw0mzvk1" rel="noopener noreferrer"&gt;vote to clarify the day partition transform result type as date&lt;/a&gt;, and it passed with six binding and seven non-binding plus-ones. Fokko Driesprong, Szehon Ho, Amogh Jahagirdar, Gang Wu, and many others backed it. The change reads like a footnote, but ambiguity in a spec is where implementations drift apart. Pinning the day transform to produce a date type keeps every engine reading and writing the same thing. The same week, a &lt;a href="https://lists.apache.org/thread/hldq6hcy6x2ygqg6vtmo0lcxw58krldh" rel="noopener noreferrer"&gt;parallel discussion&lt;/a&gt; between Andrei Tserakhau and Kevin Liu picked at the related Avro schema question for the day partition field, the kind of follow-on detail that surfaces once you settle the type.&lt;/p&gt;

&lt;p&gt;The most forward-looking vote set up deletion vectors for the next era. Ryan Blue's &lt;a href="https://lists.apache.org/thread/d4s147sd8r0jzp5o9pndjbkr4y5r74xz" rel="noopener noreferrer"&gt;vote to add the draft bitmap spec to git&lt;/a&gt; passed with thirteen plus-ones, nine of them binding, from Daniel Weeks, Anoop Johnson, Szehon Ho, Fokko Driesprong, and more. Getting a draft into the repository does not finalize anything. It gives the community a shared artifact to argue over instead of scattered ideas. Bitmaps underpin the deletion vector approach that makes row-level deletes fast, and a written draft is how that work moves from concept to spec.&lt;/p&gt;

&lt;p&gt;While those votes closed, the V4 performance conversation got concrete. Varun Lakhyani opened a &lt;a href="https://lists.apache.org/thread/xc2yslqh90ygf4n2hj2nn5bynk5f8s6v" rel="noopener noreferrer"&gt;discussion about combining three GET calls for Parquet reads&lt;/a&gt;, the serial requests for root manifest, data file, and metadata that add latency on small files. Russell Spitzer called this critical for the Root Manifest and indexing work, noting that serial GETs on a small file are a latency killer when the V4 goal is to cut it. Daniel Weeks weighed in that the fix belongs at the FileIO layer rather than buried in Parquet-specific code, since encryption and metrics handling live there too. Spitzer flagged that an Iceberg-first solution appeals, though Parquet Java itself could grow APIs to support the pattern the way the Rust and C++ implementations already do. This is the read-path work that decides whether V4 feels fast in practice.&lt;/p&gt;

&lt;p&gt;Schema evolution raised a thornier design question. Sung Yun opened a &lt;a href="https://lists.apache.org/thread/r3hvf2o8qo1zt45b81y0p25c06lxxv54" rel="noopener noreferrer"&gt;discussion about a write-path gap for field-id-bound policy during schema evolution&lt;/a&gt;, and Prashant Singh laid out the crux. Engines and catalogs resolve column names to field IDs before they persist a policy-to-table mapping, so attaching a policy needs the column to exist. Singh drew the parallel to how Iceberg assigns field IDs at table creation and treats them as the source of truth across renames. The group had modeled this exact scenario when designing ReadRestrictions, choosing to return a name and leave the metadata representation to the catalog. The debate is about where governance metadata lives and whether the spec should say more, a question that grows louder as policies and labels move into the catalog.&lt;/p&gt;

&lt;p&gt;Two REST catalog threads pushed in the same direction. Sung Yun and Alexandre Dutra discussed a &lt;a href="https://lists.apache.org/thread/wd79t4ocstqy4855ds9pqy764trlm8hv" rel="noopener noreferrer"&gt;REST spec change for passing arbitrary information to a request signer&lt;/a&gt;, and EJ Wang opened a &lt;a href="https://lists.apache.org/thread/shd3bmm15sry0gzy3xjycto063k8mbv7" rel="noopener noreferrer"&gt;thread on table and column label metadata in the REST catalog&lt;/a&gt;. Both reflect a catalog that keeps taking on more responsibility for security and governance, not just table location.&lt;/p&gt;

&lt;p&gt;The Spark connector story got a planning thread. Anurag Mantripragada, Cheng Pan, and Szehon Ho worked through a &lt;a href="https://lists.apache.org/thread/o27vy0wbojp0yorns09kng8nmks8xzzj" rel="noopener noreferrer"&gt;Spark versioning strategy for accelerated Spark releases&lt;/a&gt;. The group leaned toward supporting the last Spark LTS plus the two latest minors, with a plan to merge Spark 4.2 support first and start a separate vote to drop Spark 4.0 after a community sync. Cheng Pan tied the cadence question to Iceberg's own release rhythm, since a roughly three-month cycle keeps the supported version range manageable. Keeping pace with Spark without carrying every old version forever is a steady maintenance tax, and the community is choosing how to pay it.&lt;/p&gt;

&lt;p&gt;Maintenance and patches rounded out the week. Amogh Jahagirdar and others discussed &lt;a href="https://lists.apache.org/thread/326gz77wzgcw3gb325214fgq8nmfgs7q" rel="noopener noreferrer"&gt;1.11.1 and 1.10.3 patch releases&lt;/a&gt;, with the 1.11.x branch created the prior week and several correctness fixes lined up as backport candidates. Matt Butrovich pointed to a green PR fixing manifest delete file size after a table rewrite, and Amogh added a fix for default value handling against Parquet metrics. These are the unglamorous correctness fixes that keep production tables trustworthy. The community also looked ahead socially, with a &lt;a href="https://lists.apache.org/thread/xrozhvso1p5fp586scbhd4b310ofzw5q" rel="noopener noreferrer"&gt;discussion about Iceberg Summit 2027&lt;/a&gt; drawing input from Danica Fine, Jean-Baptiste Onofré, Bill Zhang, and Kevin Liu.&lt;/p&gt;

&lt;p&gt;Step back and the Iceberg week tells a clear story about where the project sits. The Java implementation is in steady maintenance mode, shipping patch releases and trimming CI cost, while the energy moves to two frontiers. One frontier is V4, where read latency, the root manifest, deletion vectors, and indexing all aim to make large tables fast at query time. The other is the language frontier, where C++ and Rust implementations grow toward feature parity so engines outside the JVM can read and write Iceberg natively. The day-partition clarification, the bitmap draft, and the field-id policy debate all feed those frontiers. None of them grab headlines on their own. Together they decide whether Iceberg stays the default open table format as the workload mix shifts toward low-latency and AI-driven reads.&lt;/p&gt;

&lt;p&gt;The catalog threads deserve a second look because they point at a bigger shift. The &lt;a href="https://lists.apache.org/thread/wd79t4ocstqy4855ds9pqy764trlm8hv" rel="noopener noreferrer"&gt;REST signer change&lt;/a&gt; and the &lt;a href="https://lists.apache.org/thread/shd3bmm15sry0gzy3xjycto063k8mbv7" rel="noopener noreferrer"&gt;label metadata thread&lt;/a&gt; both move responsibility into the catalog rather than the table files. A few years ago the table format held almost everything and the catalog was a thin pointer. The center of gravity is moving. Security policy, labels, request signing, and governance increasingly live at the catalog layer, which is exactly why Polaris had such a busy week. The two projects are growing into each other.&lt;/p&gt;

&lt;p&gt;It helps to say plainly what V4 is chasing, since several threads orbit it. The goal is low-latency reads on large tables. Today an Iceberg read can mean several serial round trips: fetch the root metadata, fetch a manifest, fetch the data file, and so on. On a small file over object storage, those serial GETs dominate the time, because the network round trip costs more than the actual read. The call-combining discussion, the root manifest work, deletion vectors, and indexing all attack pieces of that problem. Combine the calls and you cut round trips. Get the root manifest right and you find the data you need faster. Use deletion vectors and you skip the slow merge-on-read path. The reason this matters now is that the workload mix is shifting. More queries are interactive, more are driven by AI agents firing many small reads, and more expect sub-second answers. Iceberg won the batch analytics world. V4 is the bet that it can win the low-latency world too, and the threads this week are the early, unglamorous moves in that game.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris had its busiest week of the five, and almost all of it was the careful, detailed work of turning a young catalog into a dependable one. The threads were less about big new features and more about getting the hard parts right.&lt;/p&gt;

&lt;p&gt;The most active debate was also the smallest in scope: what status code to return when a table or view rename conflicts. The &lt;a href="https://lists.apache.org/thread/x0lt68jkw5prm8jgv5cbs5jjoo46wh1o" rel="noopener noreferrer"&gt;thread on rename conflict status codes&lt;/a&gt; ran long because the choice carries real weight for clients. Dmitri Bourlatchkov pushed to start with a simple 503, and Yufei Gu agreed, leaning the same way after reading the RFCs. Nándor Kollár argued that both 429 and 503 are imperfect, since 429 signals a client sending too many requests while the rename conflict is more of a server-side condition. The group settled toward 503 plus retry as the least-bad option, with server-side retries left as a later addition if users complain. A status code looks trivial until you remember that every client in the ecosystem has to handle whatever you pick.&lt;/p&gt;

&lt;p&gt;A larger architecture thread tackled forwarding for Iceberg scan and commit operations. Alexandre Dutra and Romain Manni-Bucau worked through the &lt;a href="https://lists.apache.org/thread/v4snn41n5d60q0h9wc1z606zqbkys2m5" rel="noopener noreferrer"&gt;design for forwarding use cases&lt;/a&gt;, including a side debate about GraalJS, which entered the picture because of Ranger and carries an 85-megabyte cost against an already large Polaris image. Dutra noted the long-term plan removes GraalJS once Ranger moves to a sidecar-style deployment. The detail matters because catalog image size and startup cost shape how cheaply Polaris runs in a container, and every dependency is a tradeoff between capability and weight.&lt;/p&gt;

&lt;p&gt;Persistence drew two connected threads. Alexandre Dutra opened a &lt;a href="https://lists.apache.org/thread/5wokvn7pbwtpqr5c8pq5vh7s54qnknp7" rel="noopener noreferrer"&gt;discussion on supporting H2 in persistence&lt;/a&gt; and a related one on &lt;a href="https://lists.apache.org/thread/75rmg8970l63o4ozj463v607mwqxhcl8" rel="noopener noreferrer"&gt;deprecating TreeMapMetaStore&lt;/a&gt;. Russell Spitzer gave the history: TreeMapMetaStore looked similar to FoundationDB from an API view, so it served as a test backend while the original backend was built. The group reached general agreement to deprecate it in favor of a JDBC plus H2 solution for tests, with Dutra flagging a few tricky spots in polaris-core where tests lean on TreeMapMetaStore as a convenience with no obvious replacement. Cleaning up test infrastructure is the kind of work that pays off invisibly, by making every future change easier to verify.&lt;/p&gt;

&lt;p&gt;Retention and observability got attention too. Yong Zheng and Adnan Hemani discussed a &lt;a href="https://lists.apache.org/thread/1l7w1xozoyyko23zbz6fz2brb0oxx1mw" rel="noopener noreferrer"&gt;mechanism to purge the events and metrics table&lt;/a&gt;, with Hemani stressing that retention boundaries are necessary for the events system to scale and that maintenance jobs must always respect pre-set limits so an admin does not accidentally delete data. Zheng planned to start with generic cronjob support in Helm so newer maintenance jobs plug in cleanly. These threads connect to a broader push around metrics reporting and REST endpoints for table metrics and events, where Dmitri Bourlatchkov, EJ Wang, and Yufei Gu have been shaping how Polaris exposes operational data.&lt;/p&gt;

&lt;p&gt;Two threads pointed at where Polaris wants to grow. Adam Christian and Adnan Hemani advanced a &lt;a href="https://lists.apache.org/thread/ykvw7zsmlvg7rfs22msfd5kbo0d2j0ot" rel="noopener noreferrer"&gt;proposal for semantic layer support&lt;/a&gt;, working through how a Dataset model nests under a Table or View and how its descriptions relate to Iceberg table property comments. A semantic layer in the catalog is a notable ambition, since it moves Polaris past pure metadata toward business meaning. Separately, EJ Wang and Adnan Hemani continued the &lt;a href="https://lists.apache.org/thread/21fvd8z3vgpbykpm0rrs6wn4notdmkbn" rel="noopener noreferrer"&gt;OpenLineage proposal&lt;/a&gt;, agreeing to preserve the endpoint shape OpenLineage clients expect while clarifying the provider boundary behind it. Wang framed the lineage work as one usable vertical slice rather than independently mergeable parts that do nothing on their own.&lt;/p&gt;

&lt;p&gt;Scale and security threads filled out the week. A GitHub-sourced &lt;a href="https://lists.apache.org/thread/v2sl0vbjp41hlpynvmjdc46vsfghr7c3" rel="noopener noreferrer"&gt;discussion on the feasibility of one realm per tenant at 10,000 tenants&lt;/a&gt; tested how Polaris multi-tenancy holds up at scale, and a &lt;a href="https://lists.apache.org/thread/lbq71qdgpkjnp5t4hccg7r11kh1s0wwf" rel="noopener noreferrer"&gt;thread on a GCP counterpart to AWS STS session tags&lt;/a&gt; worked through credential vending across clouds. The cloud-portability question keeps recurring because a catalog that only works well on one cloud is a catalog with a ceiling.&lt;/p&gt;

&lt;p&gt;The metrics and events work ran deeper than one thread. Alongside the purge discussion, Dmitri Bourlatchkov, EJ Wang, and Yufei Gu shaped a &lt;a href="https://lists.apache.org/thread/41dn9n7dzgsw0on9z954coywt8wp5g9y" rel="noopener noreferrer"&gt;proposal for REST endpoints for table metrics and events&lt;/a&gt; and a related &lt;a href="https://lists.apache.org/thread/23lnmvx71hqd20mcrxpmh7hh5py7h9pn" rel="noopener noreferrer"&gt;discussion on filters for Iceberg metrics reporting&lt;/a&gt;. Read these together and a picture forms. Polaris is building a full observability story: collect metrics and events, filter what gets reported, expose it through REST, and purge it on a retention schedule. That is the difference between a catalog you can run as a hobby and a catalog you can run as production infrastructure with audit and capacity planning built in.&lt;/p&gt;

&lt;p&gt;Governance of the project itself drew a long thread. The &lt;a href="https://lists.apache.org/thread/lgrmn61coqw6c17f1q86gwpfyhhx91go" rel="noopener noreferrer"&gt;discussion about actions on the merge button&lt;/a&gt; ran past twenty messages, with Adnan Hemani, Alexandre Dutra, Jean-Baptiste Onofré, and Robert Stupp working through how the project handles its GitHub merge workflow. A connected &lt;a href="https://lists.apache.org/thread/4loptlrl0pjmw4vmlcd24pwks89zctv2" rel="noopener noreferrer"&gt;thread on fine-grain branch and tag creation control&lt;/a&gt; sorted out who can do what in the repository. These process threads look like housekeeping, but a young top-level project has to write down its rules, and the time spent here saves friction later. The &lt;a href="https://lists.apache.org/thread/wcnlt7jz5flbo80nzkhl7hg3molr74d4" rel="noopener noreferrer"&gt;discussion on multiple StorageConfigurationInfos per catalog&lt;/a&gt; between Alexandre Dutra, Dmitri Bourlatchkov, and Robert Stupp rounded out the architecture work, tackling how one catalog handles more than one storage backend.&lt;/p&gt;

&lt;p&gt;The throughline for Polaris is maturation. Almost none of this week's work was a flashy new feature. It was error codes, persistence cleanup, retention policy, observability endpoints, multi-cloud credentials, and repository governance. That is exactly the work a catalog has to do to earn production trust. The community is choosing depth over breadth right now, and for anyone planning to run Polaris as their lakehouse catalog, that is the right order of operations.&lt;/p&gt;

&lt;p&gt;The semantic layer proposal is the one thread that points the other way, toward ambition rather than hardening, and it is worth watching for what it signals. A catalog that only tracks tables and their locations is a metadata store. A catalog that understands datasets, their descriptions, and how they nest under tables and views is starting to hold business meaning, not just physical layout. If Polaris follows that thread, it stops competing only with other catalogs and starts touching the territory of semantic layers and metrics stores. That is a big stretch for a young project, and the careful way Adam Christian and Adnan Hemani worked through how a Dataset model relates to Iceberg table property comments suggests the community knows it. Pair the semantic layer ambition with the lineage work, and you can see Polaris reaching to be the place teams ask not just where a table lives but what it means and where its data came from. Whether it gets there is a multi-quarter question. That it is trying tells you how much the catalog layer is heating up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow ran quieter than Iceberg or Polaris this week, but its threads carried weight that reaches across the ecosystem.&lt;/p&gt;

&lt;p&gt;Variant type support was the headline. Gang Wu reported that &lt;a href="https://lists.apache.org/thread/045vmh5lfy7to3otx71yq0l3mbzmb0q6" rel="noopener noreferrer"&gt;several efforts are working on the variant type in Arrow C++&lt;/a&gt;, with his colleague Zehua working on it for a while and iceberg-cpp depending on it to reach V3 feature completeness. Neelesh Salian welcomed getting variant into Arrow C++ so downstream projects benefit, and Micah Kornfield joined the discussion. Gang made a careful point about AI-generated code: variant is a complex feature that demands full spec compliance and native C++ performance, and it takes time to meet the Arrow C++ bar even when models produce decent code. The thread also surfaced a coordination problem, since duplicate efforts were underway, and the goal was to collaborate rather than build the same thing twice. Variant is the connective tissue here, a semi-structured type that Parquet, Iceberg, and Arrow all need to agree on, so getting the Arrow C++ implementation right unblocks the whole chain.&lt;/p&gt;

&lt;p&gt;Arrow also gained a new language binding. Sutou Kouhei confirmed that the &lt;a href="https://lists.apache.org/thread/p0jplofy2gtyp8l5t6z23c56qjnj10j2" rel="noopener noreferrer"&gt;Arrow Erlang repository transferred to apache/arrow-erlang&lt;/a&gt;, following the donation vote, with Benjamin Philip handling the transfer and Kou preparing the repository for its next steps. A new binding widens Arrow's reach into the Erlang and Elixir world, which carries a strong community in telecom and distributed systems. Every binding turns Arrow from a library into a lingua franca that more languages can share without copying data.&lt;/p&gt;

&lt;p&gt;The third Arrow thread was an infrastructure one shared with the rest of the foundation, covering &lt;a href="https://lists.apache.org/thread/lxtxlpln9nnsdz52rgozoq48lx3ntllo" rel="noopener noreferrer"&gt;consumption of ASF shared GitHub-hosted runners&lt;/a&gt;. Antoine Pitrou, Robert Thomson, and Sutou Kouhei worked through how much CI compute Arrow draws. That theme appears again below, because it hit nearly every project at once.&lt;/p&gt;

&lt;p&gt;Arrow's quieter week still carries outsized weight for the stack. Arrow is the in-memory format that lets these projects pass data without serializing and copying at every boundary. When an Iceberg reader hands columns to a query engine, Arrow is often the shape those columns take. So an Arrow decision about variant is not an Arrow-only decision. It sets the in-memory representation that Parquet readers, Iceberg engines, and DataFusion query plans all inherit. Gang Wu's caution about meeting the C++ performance bar reflects that responsibility. A slow or incorrect variant in Arrow C++ would ripple into every tool that depends on it. The Erlang binding points the other way, outward, growing the set of languages that can speak Arrow natively and share data with the rest of the ecosystem without a translation tax.&lt;/p&gt;

&lt;p&gt;The point Gang Wu made about AI-generated code deserves a moment on its own, because it captures a real tension in open source right now. Models can produce code that looks correct and even passes a first read. But a feature like variant has to match a written spec exactly and run at native C++ speed, and clearing that bar takes human review, benchmarking, and iteration that a quick generation does not provide. Arrow holds that line because everything downstream inherits its mistakes. It is a useful reminder that the hardest part of adding a feature to foundational software is not writing the first version. It is making the version correct and fast enough that thousands of dependent projects can trust it without checking. That standard is why the variant work moves deliberately, and why moving deliberately is the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet had the single most active thread of any project this week, and it was a big one: the future of how Parquet versions itself. The format also shipped a release in the middle of the debate, which made for a fitting contrast between long-term design and near-term delivery.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://lists.apache.org/thread/8qjyqrbkwj14hrpk8f806bwy2k3m3ds1" rel="noopener noreferrer"&gt;discussion on the future of Parquet versioning&lt;/a&gt; ran past sixty messages and pulled in Russell Spitzer, Micah Kornfield, Andrew Lamb, Antoine Pitrou, Daniel Weeks, and many more. The core tension is old and real. Parquet has added features faster than its version story can describe them, and readers need a clear way to know which features a file uses. Micah Kornfield floated splitting the idea into two notions: a primary specification version that risks using features not yet widely adopted, and presets that give users a different way to configure feature bundles. Russell Spitzer argued the group is overcomplicating it, since everyone understands versions, and urged picking something simple and moving rather than deliberating. His point was practical: the worst case is making a different choice later, which beats sitting stuck and blocking progress on new encodings and footers. This debate decides how the ecosystem talks about Parquet capability for years, so the heat is earned.&lt;/p&gt;

&lt;p&gt;A connected thread tried to make the current state legible. Andrew Lamb and Antoine Pitrou worked on &lt;a href="https://lists.apache.org/thread/nnt5lv0cl9gl45kp86x315pky0ocvtxm" rel="noopener noreferrer"&gt;documenting which features land in which versions of Parquet&lt;/a&gt;, the kind of reference that turns tribal knowledge into something a new implementer can read. The &lt;a href="https://lists.apache.org/thread/jcr0qz58j4k6p5zx0ylys1t88bh2gjf3" rel="noopener noreferrer"&gt;Parquet Footer Working Group held its second session&lt;/a&gt;, with Antoine Pitrou and Jiayi Wang continuing work on footer design, which ties directly to the read-latency goals showing up over in Iceberg.&lt;/p&gt;

&lt;p&gt;While the versioning debate raged, the format shipped. Gang Wu drove the &lt;a href="https://lists.apache.org/thread/xg4qv54q6hf4j3no9q827mz0yc31vxnb" rel="noopener noreferrer"&gt;vote for Parquet Format 2.13.0 RC0&lt;/a&gt;, which passed with three binding plus-ones from Micah Kornfield, Andrew Lamb, and Gang, plus three non-binding from Neelesh Salian, Ed Seidl, and Russell Spitzer. Ed Seidl's note that the release brings usable float statistics, something the community waited on, captured the practical payoff. The release was &lt;a href="https://lists.apache.org/thread/7jgphmyf6hpyzbbgzd0bjq9kg502oswr" rel="noopener noreferrer"&gt;announced&lt;/a&gt; once it cleared. A release that lands mid-debate is a healthy sign, since it shows the project can ship incremental value while it argues about the larger structure.&lt;/p&gt;

&lt;p&gt;New logical types kept coming. Burak Yavuz moved the &lt;a href="https://lists.apache.org/thread/zc11dj76wx3x6dn55y73xzqdnt63whwo" rel="noopener noreferrer"&gt;File logical type forward&lt;/a&gt;, submitting reference implementation PRs against parquet-format, parquet-java, and arrow-rs after the design doc settled, with Daniel Weeks recapping the discussion around the metadata and content_type fields. Rok Mihevc opened a &lt;a href="https://lists.apache.org/thread/hckrr21z7t4db2dv76dvxwwyhpnknzrn" rel="noopener noreferrer"&gt;discussion to introduce a FIXED_SIZE_LIST logical type&lt;/a&gt;, useful for fixed-length vectors of the kind that show up everywhere in machine learning feature data. Will Edwards and Jiayi Wang also dug into a &lt;a href="https://lists.apache.org/thread/88cbpnbpvo4q83xgnnn3nrsr2dc9xord" rel="noopener noreferrer"&gt;clarification on row-group and column-chunk layout&lt;/a&gt;, and an &lt;a href="https://lists.apache.org/thread/xj21vdx7wgcz0k7bp43qmdndr6nkt3yx" rel="noopener noreferrer"&gt;INT96 statistics discussion&lt;/a&gt; drew Ryan Blue, Ed Seidl, and others. The steady stream of logical types shows Parquet adapting to AI-era data shapes, where fixed-size vectors and richer file references are common.&lt;/p&gt;

&lt;p&gt;Look closer at those logical types and you can read where the data world is heading. A File logical type lets a Parquet column point at an external file with a content type attached, which is how a table starts to hold images, audio, PDFs, and other unstructured payloads next to its scalar columns. A FIXED_SIZE_LIST type stores a vector of known length, which is exactly the shape of an embedding. Put the two together and Parquet is quietly growing the vocabulary it needs to store the inputs and outputs of machine learning, not just the rows of a sales report. The format that won the analytics world by being a fast columnar store is stretching to hold the messy, high-dimensional data that AI workloads run on. That is a deliberate direction, and the people doing the work, Burak Yavuz and Rok Mihevc among them, are building it one careful logical type at a time.&lt;/p&gt;

&lt;p&gt;The versioning debate deserves a last word because it is really a debate about trust. A Parquet file written today might be read five years from now by a tool nobody has built yet. The version story is the promise that file makes to its future readers about which features they need to understand it. Micah Kornfield's split between a spec version and presets tries to separate two questions that got tangled: what a file can contain versus what a given writer chooses to turn on. Russell Spitzer's counter is that perfect is the enemy of shipped, and a clear-enough answer now beats a perfect answer that arrives after another year of new encodings pile up undescribed. Both are right, which is why the thread ran past sixty messages. The resolution will shape how every engine in the ecosystem advertises and detects Parquet capability, and that is worth getting close to right even under deadline pressure.&lt;/p&gt;

&lt;p&gt;The contrast inside the Parquet week is the real lesson. The project argued for sixty-plus messages about a deep structural question while simultaneously shipping 2.13.0 with usable float statistics and pushing three new logical types through reference implementations. A less healthy project would let the big debate freeze the small deliveries. Parquet kept both moving. That ability to ship incremental value while wrestling with long-term design is what separates a format people depend on from one they merely tolerate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache DataFusion
&lt;/h2&gt;

&lt;p&gt;DataFusion's week was about people and direction. The project grew its leadership and set its sights on the back half of the year.&lt;/p&gt;

&lt;p&gt;The headline was leadership growth. Andrew Lamb announced that &lt;a href="https://lists.apache.org/thread/jxqsywho08h1qr2v1ctn0386xhlgjqbc" rel="noopener noreferrer"&gt;Matt Butrovich joined the DataFusion PMC&lt;/a&gt;, drawing congratulations from Bruce Ritchie, Andy Grove, and a long list of contributors. Lamb's playful aside, that this time he really did mean PMC, hinted at the usual good-natured confusion that comes with back-to-back committer and member announcements. Neil Conway also drew recognition across &lt;a href="https://lists.apache.org/thread/q5jjms2d6x16367kq5tptrmrs267fsvh" rel="noopener noreferrer"&gt;committer&lt;/a&gt; and &lt;a href="https://lists.apache.org/thread/n3jz26fgyjbjm7xgosc7ob26bf0cmq9m" rel="noopener noreferrer"&gt;PMC&lt;/a&gt; threads the same week. A project that keeps promoting active contributors is a project with a healthy pipeline, and DataFusion's steady cadence of new committers and members is a strong signal under the hood.&lt;/p&gt;

&lt;p&gt;Direction came through two threads. Andrew Lamb filed a &lt;a href="https://lists.apache.org/thread/x2k66nv46289ofcnlntcrv0gy83w1g8g" rel="noopener noreferrer"&gt;discussion to coordinate the 2026 Q3 to Q4 roadmap&lt;/a&gt;, inviting the community to weigh in on where to take the project through a GitHub tracking issue. Lamb also ran a &lt;a href="https://lists.apache.org/thread/hg0qpn348ow9x5xd6o86shqhxw4rbh7v" rel="noopener noreferrer"&gt;crowdsourcing thread for the ASF board report&lt;/a&gt;, the routine governance work that keeps a top-level project accountable. The community even fielded a &lt;a href="https://lists.apache.org/thread/639048t3w4xhyz3pwwwxxdcp4k8o7b0f" rel="noopener noreferrer"&gt;PlusOne.apache.org interview thread&lt;/a&gt; with Rich Bowen, the kind of outreach that tells the broader foundation what DataFusion is up to. For a query engine that more lakehouse tools build on every quarter, a clear roadmap is a gift to everyone downstream who plans around it.&lt;/p&gt;

&lt;p&gt;If you have not tracked DataFusion closely, here is why its quiet governance week still matters. DataFusion is a query engine written in Rust, built on Arrow, that other projects embed to run SQL and DataFrame workloads without writing an execution engine from scratch. It is the engine inside a growing list of databases and tools, which means its roadmap is not an internal matter. When DataFusion decides what to build in the back half of 2026, it sets the menu of features that every downstream product inherits. A team shipping a new analytics database on top of DataFusion plans its own year around that roadmap. So the Q3-to-Q4 thread, dull as a planning document sounds, is one of the more widely felt decisions in the Rust data world.&lt;/p&gt;

&lt;p&gt;The people story matters for the same reason. A query engine is only as healthy as the bench of maintainers who can review and merge the hard changes. Promoting Matt Butrovich to the PMC and recognizing Neil Conway across committer and member threads widens that bench. Each new maintainer is one more person who can shepherd a tricky optimizer change or a new operator without waiting on a single overloaded reviewer. For downstream projects betting their execution layer on DataFusion, the depth of that maintainer pool is a risk metric, and this week it got a little deeper. The unglamorous work of governance and promotion is how an embedded engine earns the trust to sit at the center of other people's products.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;Two patterns connected the lists this week, and both tell you something you cannot see by reading any single project.&lt;/p&gt;

&lt;p&gt;The first was a shared infrastructure squeeze. Iceberg, Arrow, DataFusion, and Polaris all ran threads about &lt;a href="https://lists.apache.org/thread/ggxbscdokpjmq28hkkmmro5tb86h8w15" rel="noopener noreferrer"&gt;consumption of ASF shared GitHub-hosted runners&lt;/a&gt; in the same window, with Bob Thomson and Robert Thomson surfacing the question across projects. When four busy projects independently confront their CI compute budget at once, it is not four problems. It is one foundation-level constraint reaching every active community at the same time. Iceberg even ran a &lt;a href="https://lists.apache.org/thread/f98h92ly30fv4djsqdr7ynkk8w7g6jnr" rel="noopener noreferrer"&gt;parallel thread on reducing CI runner time by running JDK 21 only on main and nightly&lt;/a&gt;, a direct response to the same pressure. The lesson for anyone running a large open source project is that CI cost is now a first-class governance topic, not an afterthought.&lt;/p&gt;

&lt;p&gt;The second pattern was the variant type and read performance moving in lockstep across the stack. Arrow worked on variant in C++, Parquet shipped logical types and debated versioning, and Iceberg pushed on combining read calls and finalizing its bitmap draft. These are not separate efforts. A variant value written in Parquet, described by Arrow, and read by Iceberg has to mean the same thing at every layer, and iceberg-cpp reaching V3 feature completeness depends on Arrow C++ getting variant right. The read-latency work threads through too, since the Iceberg call-combining discussion and the Parquet footer working group both chase the same goal of fewer, faster reads. The lakehouse is one system wearing four project names, and weeks like this make the seams visible.&lt;/p&gt;

&lt;p&gt;The third pattern was maturation showing up everywhere at once. Iceberg shipped patch releases and trimmed CI. Polaris wrote retention policies and cleaned up test backends. Parquet documented which features live in which versions. DataFusion promoted maintainers and crowdsourced a board report. Read in isolation, each looks like ordinary housekeeping. Read together, they show a whole ecosystem crossing the same threshold in the same quarter, from the phase where you add features to the phase where you harden them. That synchronization is not a coincidence. These projects share contributors, share a release cadence, and share the same production users pushing them toward reliability. When the lakehouse stack matures, it tends to mature all at once, because the pressure comes from the same place: real workloads that need it to not break.&lt;/p&gt;

&lt;p&gt;A human pattern ran under both. The same contributors show up across projects. Gang Wu drove an Iceberg C++ release, voted on a Parquet release, and weighed in on Arrow variant. Andrew Lamb led DataFusion governance and voted on the Parquet release. Russell Spitzer argued Parquet versioning, shaped Iceberg's read path, and recalled Polaris history. Matt Butrovich fixed Iceberg manifests and joined the DataFusion PMC. The lakehouse ecosystem is held together by people who treat all of it as one project, and that overlap is why the formats stay compatible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means If You Run a Lakehouse
&lt;/h2&gt;

&lt;p&gt;Mailing list threads can feel far from a production system, so here is the practical read for anyone whose job depends on this stack.&lt;/p&gt;

&lt;p&gt;If you run Iceberg in production, the patch-release work matters most to you this week. The 1.11.1 and 1.10.3 fixes target real correctness bugs around manifest delete file sizes after rewrites and default value handling against Parquet metrics. Those are the kinds of bugs that quietly produce wrong results or bloated metadata, so plan to pick up the patch releases when they land. The V4 read-path work is further out, but it tells you where performance gains will come from next year, which is worth knowing if you are sizing hardware or planning a migration.&lt;/p&gt;

&lt;p&gt;If you are evaluating Polaris as your catalog, this week is reassuring. The observability work, retention policies, persistence cleanup, and multi-cloud credential threads are exactly the boxes a platform team checks before trusting a catalog with production tables. A year ago Polaris was a promising young project. The work landing now is the work that turns promising into dependable. If you held off because it felt early, the maturation curve is bending in the right direction.&lt;/p&gt;

&lt;p&gt;If you build on Parquet, which is nearly everyone, the versioning debate is worth following even though it will not change your files tomorrow. The outcome decides how tools advertise and detect capability, and that affects whether a file written by one engine reads cleanly in another. The new logical types are a longer-horizon signal: Parquet is preparing to hold embeddings and file references, so if your roadmap includes AI features on top of your tables, the format is growing toward you.&lt;/p&gt;

&lt;p&gt;If you embed DataFusion or use a tool that does, watch the Q3-to-Q4 roadmap thread. It is the clearest public statement of what the engine will gain in the back half of the year, and planning your own work against it saves you from building something the engine is about to provide for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;Next week the open questions hang on the votes that did not close. Watch the Parquet versioning thread for a decision, since Russell Spitzer's push to pick something simple and move may finally break the deadlock. Watch the Iceberg V4 read-path work, where the call-combining discussion and the bitmap draft both feed the latency goal. Watch Polaris turn its persistence and retention discussions into merged PRs, and watch the semantic layer proposal for how far the catalog stretches beyond metadata. The Spark 4.0 removal vote in Iceberg should also appear after the community sync. None of these are flashy. All of them shape the lakehouse you build on next year.&lt;/p&gt;

&lt;p&gt;The deeper thing to watch is whether the catalog keeps absorbing responsibility. This week Iceberg moved label metadata and request signing toward the REST catalog, while Polaris built observability endpoints, retention, and a semantic layer proposal. Both projects are pushing the same direction: the catalog stops being a thin pointer to table locations and becomes the place where governance, security, lineage, and even business meaning live. If that trend holds, the catalog you pick will matter as much as the table format you pick, maybe more. That is a real shift from how teams thought about this stack two years ago, when the format was everything and the catalog was an afterthought. Keep an eye on it, because it changes how you should evaluate the whole lakehouse.&lt;/p&gt;

&lt;p&gt;One more thing worth tracking is the CI compute question, dull as it sounds. Four projects hit it the same week, which means the foundation is feeling a real constraint. How the ASF and these communities resolve who pays for shared runners will shape how fast they can ship. Open source velocity is not free, and this week made the bill visible. The resolution will not make headlines, but it will quietly set the pace of everything else on this list.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-16&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; — Build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-16&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; — Learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>A Frontier Model Goes Dark: AI Week of June 16, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Tue, 16 Jun 2026 20:53:35 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/a-frontier-model-goes-dark-ai-week-of-june-16-2026-1gk9</link>
      <guid>https://dev.to/alexmercedcoder/a-frontier-model-goes-dark-ai-week-of-june-16-2026-1gk9</guid>
      <description>&lt;p&gt;The biggest AI story this week did not start with a launch. It started with a takedown. A US export control order pulled two frontier models offline, and the shock reached coding tools, chip strategy, and the open protocols that hold agent systems together. Here is what happened across the three areas that matter most for builders.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: A Mythos-Class Model Arrives, Then Vanishes
&lt;/h2&gt;

&lt;p&gt;The week split into a before and an after. Before June 12, the coding-tool conversation centered on a new top-tier model and another round of pricing changes. After June 12, it centered on a question almost nobody had planned for: what do you do when a government turns off the model your workflow depends on?&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Fable 5 Launched, Then Got Pulled in Three Days
&lt;/h3&gt;

&lt;p&gt;Anthropic released Claude Fable 5 on June 9, 2026. The model was the company's first public release in its new Mythos class, a tier that sits above the Opus line in raw capability. Fable 5 shipped inside Claude Code and arrived in GitHub Copilot the same day for Pro+, Max, Business, and Enterprise subscribers. Cursor users could route to it through the Anthropic API. For three days, it looked like the strongest coding model on the market.&lt;/p&gt;

&lt;p&gt;Then it disappeared. On June 12 at 5:21 PM Eastern, &lt;a href="https://www.anthropic.com/news/fable-mythos-access" rel="noopener noreferrer"&gt;Anthropic received a US export control directive&lt;/a&gt; ordering it to suspend all access to Fable 5 and Mythos 5 by any foreign national, inside or outside the United States. That scope included Anthropic's own foreign-national staff. The company could not filter users by nationality in real time across dozens of cloud platforms. So it shut both models down for everyone.&lt;/p&gt;

&lt;p&gt;The reach of the order was wide. Reporting from &lt;a href="https://qz.com/anthropic-fable-5-mythos-5-export-control-directive-061226" rel="noopener noreferrer"&gt;Quartz&lt;/a&gt; and others noted that Commerce Secretary Howard Lutnick sent the letter directly to CEO Dario Amodei. The shutdown hit AWS Bedrock, Google Cloud, Microsoft Foundry, Snowflake, Box, and the direct Claude API at the same time. Access to every other Claude model, including Opus 4.8, stayed online. Developers who had pinned their agent stacks to Fable 5 woke up to a model that no longer existed.&lt;/p&gt;

&lt;p&gt;Anthropic pushed back in public. The company said the order &lt;a href="https://www.marktechpost.com/2026/06/13/anthropic-disables-claude-fable-5-and-mythos-5-after-us-government-order/" rel="noopener noreferrer"&gt;stemmed from a narrow jailbreak&lt;/a&gt;, a code-reading technique that triggered a capability the government flagged on national security grounds. Anthropic argued that recalling a model used by hundreds of millions of people over one narrow exploit sets a standard that would freeze frontier launches across the whole industry. It called the action a misunderstanding and said it is working to restore access.&lt;/p&gt;

&lt;p&gt;For builders, the lesson lands hard. This looks like the first government-forced takedown of a publicly deployed frontier model. A model is not a stable dependency. It is a service that can vanish on a Friday evening with no warning and no migration window. Teams that hard-coded one model name into agent prompts, eval suites, and CI pipelines learned the cost of single-model coupling in one night.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the Suspension Reshapes How You Build
&lt;/h3&gt;

&lt;p&gt;The fix is not new, but the week made it urgent. Route through an abstraction layer, not a hard model string. Keep a tested fallback model wired into every agent path. Run your eval suite against at least two models so a swap does not break behavior you cannot see.&lt;/p&gt;

&lt;p&gt;The security angle matters too. &lt;a href="https://snyk.io/blog/fable-mythos-suspension-security-takeaways/" rel="noopener noreferrer"&gt;Snyk's write-up&lt;/a&gt; pointed out that the reported trigger was a code-analysis capability that defenders use every day. The same skill that helps a security team read a hostile binary can read a sensitive one. That tension will shape how the next class of models ships, and how much capability gets gated behind classifiers before release.&lt;/p&gt;

&lt;p&gt;There is a business subplot. &lt;a href="https://fortune.com/2026/06/13/anthropic-disables-fable-mythos-export-controls-national-security-threat/" rel="noopener noreferrer"&gt;Fortune reported&lt;/a&gt; that Anthropic confidentially filed for a public listing earlier in June, with a recent round valuing the company near $965 billion. A government that singles out your flagship models adds a new risk line to any IPO story. Investors now have to price the chance that a regulator pulls your best product without explanation.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Copilot Moves to Usage-Based Billing
&lt;/h3&gt;

&lt;p&gt;The pricing story did not pause for the drama. GitHub moved Copilot from request-based billing to usage-based billing on June 1, 2026, and the new structure is now live. The shift changes the math for heavy agent users, who burn tokens fast in long autonomous runs.&lt;/p&gt;

&lt;p&gt;The current individual plans, &lt;a href="https://www.developersdigest.tech/blog/ai-coding-tools-pricing-june-2026" rel="noopener noreferrer"&gt;verified from GitHub's pricing page on June 15&lt;/a&gt;, set a clear ladder. Free gives 2,000 completions per month plus access to Haiku 4.5 and GPT-5 mini. Pro runs $10 a month with unlimited completions, cloud agent access, and $15 in included AI credits. Pro+ runs $39 a month with premium model access and $70 in credits. Max runs $100 a month with priority model access and $200 in included credits.&lt;/p&gt;

&lt;p&gt;The credit model rewards teams that watch their spend. A developer who runs agents all day lands far above the base tier. GitHub's own docs admit that daily agent users often pay $60 to $100 a month in practice, not the $10 sticker. The era of a flat coding-assistant subscription is closing. Token accounting is now part of the job.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cursor Crosses $2B ARR as the Market Splits
&lt;/h3&gt;

&lt;p&gt;Cursor keeps climbing. The company behind it, Anysphere, &lt;a href="https://pasqualepillitteri.it/en/news/3392/github-copilot-cursor-claude-code-ai-coding-showdown-2026" rel="noopener noreferrer"&gt;reached $2 billion in annual recurring revenue&lt;/a&gt;, with revenue that doubled every two months across a long stretch of 2025 and early 2026. Cursor's path ran from $100 million ARR in January 2025 to $1 billion by mid-year and past $2 billion since.&lt;/p&gt;

&lt;p&gt;The competitive picture sharpened. A JetBrains survey of developers with more than ten years of experience found that 46% picked Claude Code as their daily tool and 9% picked Copilot. Copilot's overall share slid from 67% to 51% over the same window. Microsoft has responded with agent mode, bring-your-own-key model support, and access to Anthropic's protocols inside VS Code Insiders. The read from the field is that Copilot is extending a legacy product while Cursor and Claude Code were built around agents from the start.&lt;/p&gt;

&lt;p&gt;The tools are also converging into one stack. &lt;a href="https://thenewstack.io/ai-coding-tool-stack/" rel="noopener noreferrer"&gt;The New Stack&lt;/a&gt; framed it well: most real teams now run more than one tool at once. Autocomplete tools work at line-level latency, sub-second and scoped to the open file. Agentic tools take a task and run for minutes across many files. These are different categories, not rivals. A team that codes mostly incremental edits wants strong autocomplete. A team shipping features across five services wants agents. Most need both, which is why multi-tool stacks became the norm in 2026 rather than the exception.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Wider Pricing Reset
&lt;/h3&gt;

&lt;p&gt;Copilot was not alone in changing its bill. The whole market reset its pricing in the first half of 2026. OpenAI moved Codex to token-based credits on April 2, 2026, for most Plus, Pro, Business, and Enterprise customers. The old all-you-can-eat framing gave way to metered usage across nearly every vendor at once.&lt;/p&gt;

&lt;p&gt;Cursor's ladder grew a rung. The company added Pro+ at $60 a month between Pro at $20 and Ultra at $200, &lt;a href="https://spectrumailab.com/blog/ai-coding-tools-pricing-compared-2026" rel="noopener noreferrer"&gt;per its current pricing&lt;/a&gt;. Cursor's own docs warn that daily agent users land closer to $60 to $100 a month than the $20 sticker. The pattern repeats across the field. The headline price buys a starter bucket of tokens, and real agent work spends past it.&lt;/p&gt;

&lt;p&gt;The spreadsheet you saved six months ago is wrong. New model tiers, renamed products, and metered billing moved the numbers faster in early 2026 than in any comparable stretch. A team that picks a tool on last quarter's pricing page will misjudge its real monthly cost. The work now includes tracking token spend the way you track cloud spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agents Move Beyond Code
&lt;/h3&gt;

&lt;p&gt;The agent model is spreading past the IDE. Anthropic introduced Cowork earlier in 2026, described as Claude Code for general computing. The product runs the same agent loop across spreadsheets, file management, report drafting, and workflow tasks for people who do not write code. The coding agent became a template for knowledge work.&lt;/p&gt;

&lt;p&gt;The company also took the message on the road. The Code with Claude conference ran in San Francisco on May 6, then London on May 19, then Tokyo on June 10. The tour confirmed a shift in the business model. The product is no longer a license for an assistant that completes lines. It is the sale of an agent that does whole tasks, billed by what it consumes.&lt;/p&gt;

&lt;p&gt;The skills ecosystem amplifies the pull. Anthropic's guide for scaling Claude Code in enterprise codebases became one of the most-read documents in the field, and a market of reusable skills built a network effect around the tool. When a workflow library grows around a product, switching costs grow with it. That is part of why Claude Code's share climbed so fast among senior developers.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Quality and Security Reality
&lt;/h3&gt;

&lt;p&gt;The productivity gains are real, and so are the catches. A &lt;a href="https://dancumberlandlabs.com/blog/best-ai-coding-tools/" rel="noopener noreferrer"&gt;Veracode study&lt;/a&gt; found 45% of AI-generated code fails security tests, with 62% of samples carrying design flaws. The risk is manageable with code review and automated scanning, but it does not vanish because the code came from a strong model. Review discipline matters more as agents write more.&lt;/p&gt;

&lt;p&gt;The payback picture is sober. About 62% of teams report at least a 25% productivity gain, mostly on routine coding. True costs run two to three times the subscription fee once you count review, rework, and tooling. Only a small share of firms have measured a clear payback, and most successful teams reach return on investment over two to four years, not two to four weeks. The tools help. They are not magic, and the bill is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Contract Fine Print Just Got Tested
&lt;/h3&gt;

&lt;p&gt;The Fable 5 shutdown exposed a gap in enterprise contracts. Many service agreements lean on force majeure clauses that never imagined an instant government-mandated cutoff. &lt;a href="https://www.fifthrow.com/blog/us-export-control-order-and-global-suspension-of-fable-5-mythos-5-operationalizing-compliance-as-a-live-mandate" rel="noopener noreferrer"&gt;One analysis&lt;/a&gt; noted that incident teams found the hard limits of legacy compliance language overnight. A clause written for natural disasters does not cover a regulator pulling a model.&lt;/p&gt;

&lt;p&gt;The lesson for procurement is concrete. Read the model-availability terms in your vendor agreement. Ask what happens to your workloads if a specific model goes dark with no notice. Build the answer into your own service levels so a vanished model becomes a known fallback path, not an emergency. The teams that wrote substitution into their contracts slept fine on June 12. The teams that assumed a model would always be there did not.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Model Is Not the Moat
&lt;/h3&gt;

&lt;p&gt;The week reframed where durable advantage lives. A model can launch on Tuesday and vanish on Friday. So the model itself is a poor place to build a moat. The lasting edge sits in the layers you own: your eval suite, your workflow library, your data access, and the standards that let you swap parts without a rewrite.&lt;/p&gt;

&lt;p&gt;This is the quiet argument running under the whole coding-tool race. Claude Code's pull among senior developers came less from any single model and more from the skills ecosystem and enterprise guidance around it. Cursor's growth came from a product built around agents, not from owning a model. When models commoditize and swap in and out, the workflow and the data underneath decide who wins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evals Become the Asset You Keep
&lt;/h3&gt;

&lt;p&gt;The June 12 shutdown made one point sharp: your eval suite is the asset that survives a model swap. When Fable 5 vanished, teams with strong evals could test Opus 4.8 against the same tasks and measure the gap in an hour. Teams without evals had to guess whether their prompts still worked.&lt;/p&gt;

&lt;p&gt;Evals are how you turn a model change from a crisis into a routine check. A good suite captures the tasks you care about, the edge cases that bite, and the quality bar you ship against. Point it at a new model and you get a number, not a hunch. That number is what lets you swap models on purpose rather than in a panic.&lt;/p&gt;

&lt;p&gt;The investment compounds. Every eval you write keeps paying off across every future model. The model you run today will not be the model you run next year, but the tasks you need it to do stay mostly the same. Build the eval suite once, and you own a stable measuring stick for a market that will not stop moving.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: Nvidia Goes After the PC, Custom Silicon Goes After Nvidia
&lt;/h2&gt;

&lt;p&gt;Hardware news this week pulled in two directions at once. Nvidia pushed down into the laptop and desktop market it never owned. Its biggest cloud customers pushed up into the inference market Nvidia has owned for years. The result is a chip landscape where the lines between training, inference, and on-device work keep blurring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nvidia's RTX Spark Aims to Reinvent the PC
&lt;/h3&gt;

&lt;p&gt;At Computex in Taipei on June 1, 2026, Nvidia CEO Jensen Huang &lt;a href="https://www.cnbc.com/2026/06/02/nvidias-new-pc-chips-are-ceos-bid-to-own-every-part-of-ai-stack.html" rel="noopener noreferrer"&gt;introduced the RTX Spark Superchip&lt;/a&gt;, a system-on-chip aimed at Windows machines. Huang said Nvidia and Microsoft plan to "reinvent the PC." The move pushed Nvidia into a market it had mostly skipped while it built its data center empire.&lt;/p&gt;

&lt;p&gt;Wall Street read the threat fast. Shares of AMD, Intel, and Qualcomm slid on the news. Those three have built their plans around the PC and the edge, and Nvidia just walked onto their turf with a chip designed to run local models on consumer hardware. The pitch is simple. If you own the data center, the workstation, and now the laptop, you own every layer where AI runs.&lt;/p&gt;

&lt;p&gt;The on-device angle is the part that matters for app builders. Laptop-class chips now carry real AI horsepower. NPUs in current SoCs from Intel, AMD, and Apple deliver 40 to 50 TOPS of local inference, &lt;a href="https://calmops.com/ai/ai-hardware-accelerators-complete-guide/" rel="noopener noreferrer"&gt;per a 2026 hardware survey&lt;/a&gt;. For small-batch work, those dedicated NPUs run 10 to 15 times more power-efficiently than GPU execution. That changes which models you run in the cloud and which you run on the machine in front of you.&lt;/p&gt;

&lt;h3&gt;
  
  
  The PC Becomes an AI Device
&lt;/h3&gt;

&lt;p&gt;Nvidia's PC push lands on hardware that finally has the power to matter. A laptop that runs a useful model locally changes the privacy math. Data that never leaves the device cannot leak from a cloud breach. For health, finance, and legal work, local inference turns a compliance headache into a design feature.&lt;/p&gt;

&lt;p&gt;The split-tier pattern follows from there. Fast, private, small-model work runs on the device. Heavy reasoning and large-context jobs go to the cloud. The application decides which tier handles each request based on data sensitivity, latency budget, and model size. That routing logic becomes part of the product, not an infrastructure detail buried in ops.&lt;/p&gt;

&lt;p&gt;This reshapes data architecture in a concrete way. A local tier needs its own slice of data and its own access rules, kept in sync with the cloud tier. The boundary between them is where governance lives. Plan that boundary early, because retrofitting privacy onto a system that assumed everything ran in the cloud is the hard way to learn this lesson.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hyperscalers Surround Nvidia With Their Own Chips
&lt;/h3&gt;

&lt;p&gt;Nvidia's best customers are now its sharpest rivals. Amazon, Google, and Microsoft each ship second- and third-generation AI processors of their own design, &lt;a href="https://windowsnews.ai/article/ai-chip-wars-2026-amazon-google-and-microsoft-surround-nvidia-with-custom-silicon.423926" rel="noopener noreferrer"&gt;according to a recent chip-wars analysis&lt;/a&gt;. Analysts expect custom silicon to capture 15 to 20% of the AI inference market and 10 to 15% of training by 2026, up from under 5% a year earlier.&lt;/p&gt;

&lt;p&gt;Amazon leads on volume. More than 60% of AWS machine learning instances now run on some form of Amazon silicon, from Inferentia to Trainium, with Trainium3 already in development for trillion-parameter models. Google keeps pushing TPUs into inference, not just training. Its TPU 8i carries three times more on-chip SRAM for longer KV cache, a Collectives Acceleration Engine for faster token sampling, and a network topology that cuts all-to-all latency in half for mixture-of-experts and reasoning workloads.&lt;/p&gt;

&lt;p&gt;The pattern is clear. Each hyperscaler wants its own chips to be the default for most AI services while keeping Nvidia GPUs around for customers who ask. That gives them pricing power, supply control, and a hedge against Nvidia's margins. For teams picking where to run inference, the menu got longer and the price points got more spread out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference Chips Get Specialized and Faster
&lt;/h3&gt;

&lt;p&gt;The inference market is splitting off from training as its own hardware race. Nvidia's own &lt;a href="https://www.aol.com/articles/nvidias-20-billion-groq-acquisition-141500366.html" rel="noopener noreferrer"&gt;Groq 3 LPX inference accelerator&lt;/a&gt; folds in the high memory bandwidth design from Groq, which Nvidia bought for $20 billion in late 2025. The product pairs Groq's low-latency approach with Nvidia's processing, aimed straight at the latency-sensitive serving market.&lt;/p&gt;

&lt;p&gt;Memory keeps moving to the center of the design. At CES in January, both Nvidia's Rubin platform and AMD's Helios platform &lt;a href="https://futurumgroup.com/insights/at-ces-nvidia-rubin-and-amd-helios-made-memory-the-future-of-ai/" rel="noopener noreferrer"&gt;made memory the headline feature&lt;/a&gt;, since reasoning models need to hold long context and large KV caches in fast memory. The bottleneck for modern inference is rarely raw compute. It is how much state you can keep close to the cores and how fast you can move it.&lt;/p&gt;

&lt;p&gt;Open challengers keep arriving too. The week brought reports of new model releases from outside the big US labs, including a fresh GLM model from Z.ai, which keeps pressure on cost-per-token across the serving market. When a capable open model lands, the price of every closed model that serves the same task gets a fresh test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Export Controls Reshape the Supply Chain
&lt;/h3&gt;

&lt;p&gt;Chips are now a geopolitical instrument, and the same week proved it twice. The Fable 5 takedown was one example. The chip supply is another. On January 14, 2026, the US Bureau of Industry and Security &lt;a href="https://calmops.com/ai/ai-hardware-accelerators-complete-guide/" rel="noopener noreferrer"&gt;shifted its license policy&lt;/a&gt; for advanced AI chips bound for China from presumption of denial to case-by-case review for Nvidia H200 and AMD MI325X-class parts.&lt;/p&gt;

&lt;p&gt;The change cuts both ways. Case-by-case review opens a door that was nearly shut, but it adds uncertainty to every cross-border deployment plan. A data center build that depends on a specific accelerator now carries license risk on top of supply risk. Vendors design product lines around these rules, which is why China-market parts and global parts keep diverging.&lt;/p&gt;

&lt;p&gt;Chinese makers fill the gap with their own inference parts. Domestic accelerators aimed at the H20 price-performance band keep shipping, which keeps a second supply track alive inside China. For teams with global footprints, the takeaway matches the coding lesson: avoid hard dependence on one part you cannot guarantee you can buy next year.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open Hardware Pushes Against CUDA
&lt;/h3&gt;

&lt;p&gt;Nvidia's software moat draws steady challengers. Tenstorrent ships RISC-V accelerators, and Intel keeps an open software stack around Gaudi, both pitched as alternatives to CUDA lock-in. Intel's Gaudi 3 targets the cost-sensitive band, claiming strong price-performance against older Nvidia parts on large-model inference.&lt;/p&gt;

&lt;p&gt;The open-stack pitch is about freedom to move. CUDA is fast and mature, but it ties your kernels to one vendor. ROCm and open RISC-V stacks trade some maturity for portability. The choice mirrors the protocol debate one layer up. You pay a little in convenience now to keep the option to switch later.&lt;/p&gt;

&lt;p&gt;The edge keeps its own race. Nvidia's Jetson holds the high-performance edge slot at 275 TOPS, but its power budget rules out battery work. For always-on, low-power inference, dedicated NPUs win on TOPS per watt by a wide margin. The right edge chip depends on whether you tune for peak throughput or for battery life, and those two goals pull toward different silicon.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Means for Data and Inference Costs
&lt;/h3&gt;

&lt;p&gt;For data teams, the through-line is cost and placement. The model you pick is now tied to the chip it runs on and the memory that chip carries. A reasoning model with a long context window costs more to serve on memory-starved hardware. The same model runs cheaper on a chip built for long KV cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Is the New Bottleneck
&lt;/h3&gt;

&lt;p&gt;The hardest part of serving a modern model is rarely the math. It is the memory. Reasoning models hold long chains of thought and large key-value caches, and all of that state has to sit in fast memory next to the cores. When the cache spills, latency climbs and cost climbs with it.&lt;/p&gt;

&lt;p&gt;This is why the latest chips lead with memory. The TPU 8i added three times more on-chip SRAM precisely to hold longer KV cache. AMD's MI300X wins on jobs where raw memory capacity per chip cuts sharding complexity and lifts real throughput. The spec sheet line that matters most for inference is no longer peak FLOPS. It is memory capacity and bandwidth.&lt;/p&gt;

&lt;p&gt;The practical effect reaches your bill. A model that fits in memory on one accelerator and spills on another shows a wide cost gap for the same workload. Sizing your hardware to your context window length is now a core part of serving design, not an afterthought you tune later.&lt;/p&gt;

&lt;p&gt;On-device inference reshapes the data path. When a 40-TOPS laptop can run a useful model locally, some queries never touch the cloud. That cuts latency and keeps sensitive data on the machine. It also splits your architecture into a local tier and a cloud tier, each with its own model and its own data access rules. Planning for both is now a first-class design choice, not an edge case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards and Protocols: The Agent Stack Gets Its Plumbing
&lt;/h2&gt;

&lt;p&gt;Models grab the headlines, but protocols decide whether agents can actually work together. This week the open-standards layer moved forward on two fronts: a developer summit that drew the community together, and a spec cycle that pushes agents from isolated tools toward composable systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Dev Summit Lands in Mumbai
&lt;/h3&gt;

&lt;p&gt;The Model Context Protocol community met in Mumbai on June 14 and 15, 2026, for an &lt;a href="https://www.linuxfoundation.org/press/agentic-ai-foundation-announces-global-2026-events-program-anchored-by-agntcon-mcpcon-north-america-and-europe" rel="noopener noreferrer"&gt;MCP Dev Summit&lt;/a&gt; co-located with Open Source Summit India and KubeCon plus CloudNativeCon India. The summit is one stop in a global series that runs through 2026, with stops planned for Seoul, Shanghai, Tokyo, Toronto, and Nairobi, plus flagship AGNTCon and MCPCon events in Amsterdam in September and San Jose in October.&lt;/p&gt;

&lt;p&gt;The series tells you something about where MCP sits now. A protocol that started as one company's idea in late 2024 draws conference halls full of developers in 2026. MCP handles the agent-to-tool connection. It is the layer that lets a model read a file, run a function, or query a database through one standard interface instead of a custom connector per source.&lt;/p&gt;

&lt;p&gt;The scale is real. Community registries index more than 18,000 MCP servers, and the official MCP Registry is moving toward general availability with signing and trust scoring on the roadmap. SDK downloads run in the tens of millions per month across Python and TypeScript. When a standard hits that kind of adoption, it stops being optional for anyone building production agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Next MCP Spec Pushes Toward Composable Agents
&lt;/h3&gt;

&lt;p&gt;The protocol's roadmap points at recursion. The next spec cycle &lt;a href="https://zylos.ai/research/2026-03-26-agent-interoperability-protocols-mcp-a2a-acp-convergence/" rel="noopener noreferrer"&gt;is set to address server-as-agent capabilities&lt;/a&gt;, where MCP servers connect to other MCP servers and compose into larger systems. That turns a flat tool list into a tree of agents that call each other through the same standard.&lt;/p&gt;

&lt;p&gt;The release candidate work also points at a leaner core. The draft direction includes a stateless protocol core, an Extensions framework, a Tasks model for longer-running work, MCP Apps for richer interfaces, hardened authorization, and a formal deprecation policy. The stateless move matters most at scale. Early MCP leaned on long-lived sessions, which made load balancing and cloud scaling hard. A stateless core removes the sticky-session bottleneck and lets agents connect to a pool of servers without state headaches.&lt;/p&gt;

&lt;p&gt;Security is the part that keeps getting fixed. Earlier research found many deployed MCP servers shipped with weak authentication. The OAuth 2.1 update helped, but adoption stayed uneven. Prompt injection against tool descriptions is still an open problem. Each spec cycle tightens these gaps, and each tightening is the difference between a demo and a system you can trust with real data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tasks and MCP Apps Change What Agents Can Do
&lt;/h3&gt;

&lt;p&gt;Two additions in the spec direction stand out for builders. The Tasks model handles work that runs longer than a single request. Early MCP fit quick tool calls that returned fast. Real data jobs do not. A large scan, a model training step, or a multi-stage pipeline runs for minutes or hours, and Tasks gives agents a clean way to start that work, track it, and collect the result without holding a connection open the whole time.&lt;/p&gt;

&lt;p&gt;MCP Apps push the other direction, toward richer interfaces. A plain tool returns text. An MCP App can return an interactive surface the agent and the user work with together. For data work, that means an agent can hand back a chart, a table the user filters, or a form the user fills, all through the same protocol that carries the tool calls. The agent stops being a text pipe and starts driving real interfaces.&lt;/p&gt;

&lt;p&gt;Together these features close the gap between a chat demo and a working system. Long-running Tasks match how data pipelines actually run. Richer App surfaces match how people actually review results. The protocol is growing into the shape that production data work needs, one spec cycle at a time.&lt;/p&gt;

&lt;h3&gt;
  
  
  A2A and the Foundation That Governs the Layer
&lt;/h3&gt;

&lt;p&gt;The agent-to-agent side of the stack runs in parallel. The Agent2Agent protocol, started by Google with more than 50 partners, handles peer coordination through Agent Cards that let agents discover and call each other across vendors and frameworks. MCP connects agents to tools. A2A connects agents to agents. Most production designs now use both: MCP for tool access and A2A for coordination.&lt;/p&gt;

&lt;p&gt;Governance moved to neutral ground. The &lt;a href="https://www.ciodive.com/news/big-tech-develop-open-standards-agentic-ai/807608/" rel="noopener noreferrer"&gt;Agentic AI Foundation&lt;/a&gt;, under the Linux Foundation, now stewards MCP, A2A, AGENTS.md, and the goose agent framework together. Open governance gives enterprises a reason to commit. A protocol run by one vendor carries lock-in risk. A protocol run by a neutral foundation gives teams portability, the freedom to move workloads between environments without a rewrite.&lt;/p&gt;

&lt;p&gt;For data engineers, this is the part that changes daily work. When your warehouse, catalog, and query engine speak MCP, an agent can reach live data through one interface and act on what it finds. The standards layer is what turns a chat model into a system that queries production tables, checks the result, and takes the next step. That is the bridge between AI and the data it needs to be useful.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Agent Cards Actually Work
&lt;/h3&gt;

&lt;p&gt;A2A's discovery model rests on a small idea with big reach. Each agent publishes an Agent Card, a structured description of what it does, how to reach it, and what it expects. Another agent reads the card and decides whether to hand off a task. The card is the handshake that lets agents built by different vendors find and trust each other.&lt;/p&gt;

&lt;p&gt;This is what makes cross-vendor coordination practical. Without a shared discovery format, every agent pairing needs a custom integration, the same N-times-M problem MCP solved for tools. Agent Cards turn that into a lookup. An agent that needs a translation step finds a translation agent through its card and calls it, with no prior wiring between the two teams.&lt;/p&gt;

&lt;p&gt;The security work tracks the tool side closely. Delegation chains, where one agent acts on behalf of another, need clear authorization at each hop. The guidance for teams building now is to design those chains from day one rather than bolt them on later. An agent network without delegation controls is a network where one compromised agent reaches everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Server-as-Agent Unlocks for Data Pipelines
&lt;/h3&gt;

&lt;p&gt;The server-as-agent direction matters most for data work. Today an MCP server exposes tools to one agent. With recursive composition, a server can act as an agent itself and call other servers. A data pipeline becomes a tree: a top agent asks a catalog server for tables, the catalog server asks a storage server for files, and each step speaks the same protocol.&lt;/p&gt;

&lt;p&gt;That structure maps cleanly onto a lakehouse. A query agent reaches a catalog through MCP, the catalog resolves table metadata, and a compute layer runs the scan and returns rows the agent can reason over. When every layer speaks one protocol, you compose new behavior by wiring servers together rather than writing glue code for each pair.&lt;/p&gt;

&lt;p&gt;The payoff is reuse. A well-built MCP server for your governance layer serves every agent in the company, not just one app. Build the data access once, expose it through the standard, and any agent that speaks MCP can use it under the same permissions. That is how a standards layer turns scattered integrations into shared infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Registry and the Trust Problem
&lt;/h3&gt;

&lt;p&gt;Eighteen thousand MCP servers is a big number with a sharp edge. Most of those servers come from the community, and a tool an agent calls is a tool that can misbehave. The official MCP Registry is moving toward general availability with signing and trust scoring, which is the field's answer to a supply-chain risk that grows with every new server.&lt;/p&gt;

&lt;p&gt;The risk is real and specific. A malicious tool description can carry a prompt injection that steers an agent off task. A poorly secured server can leak the data it was meant to guard. The registry work aims to give teams a way to check provenance before they wire a server into a production agent. Signed servers and trust scores turn a wild directory into something closer to a package index you can audit.&lt;/p&gt;

&lt;p&gt;For data teams, the rule is the same one you already apply to dependencies. Pin what you trust. Review what you add. Treat a third-party MCP server like any other piece of code that touches production data, because that is what it is. The standard makes integration easy, and easy integration is exactly why provenance checks matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Data Teams
&lt;/h2&gt;

&lt;p&gt;Pull the three threads together and a clear playbook falls out for anyone building data and AI systems. The week rewarded teams designed for change and punished teams built on a single point of failure.&lt;/p&gt;

&lt;p&gt;Decouple from any one model. Route agents through an abstraction that lets you swap models without touching prompts or pipelines. Keep at least one tested fallback wired in. Run evals against more than one model so a forced swap does not break behavior you cannot see. The Fable 5 night proved this is not a theoretical concern.&lt;/p&gt;

&lt;p&gt;Plan inference placement on purpose. Some work belongs on a 40-TOPS laptop, some on a memory-rich cloud accelerator, and the split is a design choice with real cost and latency effects. Match the model to the chip and the chip to the memory the model needs. A reasoning model with a long context window costs far more on memory-starved hardware than on silicon built for long KV cache.&lt;/p&gt;

&lt;p&gt;Build on open standards and open formats. When your data lives in open table formats and your agents reach it through MCP and A2A, no single vendor decision can strand your stack. The model can change, the chip can change, and your data and your access layer stay yours. That portability is the whole point of an open lakehouse, and this week showed why it is worth the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Through-Line: Dependencies You Do Not Control
&lt;/h2&gt;

&lt;p&gt;Step back and the week tells one story across all three areas. Builders depend on things they do not control: a model a government can pull, a chip supply a few firms shape, a protocol a foundation governs. The Fable 5 shutdown was the loudest reminder, but the chip wars and the protocol cycle carry the same lesson. The teams that win in 2026 design for substitution. They wire in fallbacks, they avoid single-vendor coupling, and they build on open standards that let them swap parts without starting over.&lt;/p&gt;

&lt;p&gt;That is also the case for an open data foundation. When your data lives in open formats and your agents reach it through open protocols, no single outage and no single vendor decision can take your stack down. The model can change. The chip can change. The data and the standards underneath stay yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Watch Next Week
&lt;/h2&gt;

&lt;p&gt;Three threads run into next week. First, the Fable 5 status. Anthropic says it is working to restore access and calls the order a misunderstanding. Watch whether the government clarifies its rationale or whether the suspension holds, because the answer sets the precedent for every frontier launch that follows.&lt;/p&gt;

&lt;p&gt;Second, the chip response. Nvidia's PC push will draw counters from AMD, Intel, and Qualcomm, who just watched their stock slide. Watch for on-device model announcements that pair with the new laptop silicon, since the hardware needs software to matter.&lt;/p&gt;

&lt;p&gt;Third, the protocol cycle. The next MCP spec moves the standard toward stateless cores, Tasks, and server-as-agent composition. Watch the registry's path to general availability with signing and trust scoring, because that is what turns 18,000 community servers into infrastructure you can trust with production data.&lt;/p&gt;

&lt;p&gt;The pattern across all three is the one this whole issue keeps circling. Build for change. Own your evals, your data, and your access layer. Treat every model, chip, and vendor as a part you can swap, because this week proved that any of them can change without warning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt; — Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-16-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt; — Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-16-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt; — Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-16-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt; — Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt; — A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
      <category>news</category>
    </item>
    <item>
      <title>The Direction of AI in 2026: Performance, Cost, and the End of One Model for Everything</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Sat, 13 Jun 2026 18:35:37 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/the-direction-of-ai-in-2026-performance-cost-and-the-end-of-one-model-for-everything-1i6g</link>
      <guid>https://dev.to/alexmercedcoder/the-direction-of-ai-in-2026-performance-cost-and-the-end-of-one-model-for-everything-1i6g</guid>
      <description>&lt;p&gt;Six months ago, I could tell you which model to use for almost any job, and I would have said it with confidence. Today I hedge, and so does almost everyone I talk to who builds with these tools. The reason is simple. The ground keeps moving under us. Models get smarter on a schedule no one can forecast, and they get cheaper to run on a second schedule that is just as hard to predict. Both curves are bending at once, and they point in directions that change how I build and how I think you should build too.&lt;/p&gt;

&lt;p&gt;I spend my days crafting development, content and productivity workflows that lean on these models. I wire up agents, route tasks, and watch the bills. So this is not a far-off observation for me. It is the thing I am living with week to week, and it has forced me to rethink habits I held for years.&lt;/p&gt;

&lt;p&gt;This is not the usual story about a single breakthrough. It is four shifts happening together. Frontier performance is climbing past what most of us guessed was possible this year. Small models are getting good enough to run on a phone or a thirty-five dollar computer. The smart move has stopped being "pick the best model" and started being "build a system that picks for you." And a coding startup with a rocket company behind it is showing what happens when product, data, and compute sit under one roof.&lt;/p&gt;

&lt;p&gt;Let me take these one at a time. Then I want to show you what they add up to, because the sum is bigger than the parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Is Outrunning the Forecasts
&lt;/h2&gt;

&lt;p&gt;Start at the top of the market, where the most capable models live. Anthropic now ships a tier above its Opus line. The Fable and Mythos family is a class of model built for problems that smaller systems still fumble: long chains of reasoning, deep code work, research that needs to hold many threads at once. Claude Fable 5 carries extra safety work so it can go out to the public. A more powerful sibling, used inside a small set of trusted partners, stays behind tighter controls. The names are not the point. The point is that the ceiling rose again, and it rose faster than the last forecast said it would.&lt;/p&gt;

&lt;p&gt;What makes 2026 different, in my reading, is what happened underneath that ceiling. The floor came up to meet it.&lt;/p&gt;

&lt;p&gt;DeepSeek released V4 on April 24, 2026, in two open-weight versions under an MIT license. The larger one, V4-Pro, runs 1.6 trillion total parameters and activates 49 billion per token through a mixture-of-experts design. It scored 80.6 percent on SWE-bench Verified, a hard test of real software fixes. That number sits within two-tenths of a point of a recent Claude Opus release. Read that again. An open model you can download trails a flagship closed model by a rounding error on one of the field's toughest benchmarks.&lt;/p&gt;

&lt;p&gt;The price tells the rest of the story. V4-Pro runs around forty-four cents per million input tokens and eighty-seven cents per million output tokens. A frontier closed model can cost ten to twenty times that. So you get near-frontier quality at a fraction of the bill, with weights you can host yourself if you want to. DeepSeek's own framing is blunt. They claim their models trail the best closed systems by three to six months, not three to six years. From what I have run, that claim holds up better than I expected it to.&lt;/p&gt;

&lt;p&gt;This is the pattern I keep coming back to. The gap between the best model anyone can buy and the best model anyone can download keeps shrinking. The frontier still moves. Fable and Mythos prove that. But the distance from the frontier to "good enough for almost everything" shrinks faster than the frontier advances. Each month, the set of tasks that truly need a top-tier model gets smaller in my own work.&lt;/p&gt;

&lt;p&gt;There is a second reason the floor keeps rising so fast. The benchmarks that used to separate the best models from the rest are saturating. When a test like MMLU or HumanEval gets crowded at the top with scores in the high eighties and nineties, the difference between the flagship and the challenger collapses to a point or two. A two-tenths-of-a-point gap on SWE-bench is not a gap I will ever feel in practice. So the frontier labs push into harder territory: longer reasoning chains, agentic tasks that run for many steps, problems that need a model to plan and self-correct rather than answer in one shot. That is where Fable and Mythos earn their keep. It is also where the gap between top and challenger is widest, and where I expect it to close next.&lt;/p&gt;

&lt;p&gt;This is why the reasoning tier matters more to me now than the chat tier. A model that answers a question well is a solved problem at many price points. A model that can run a forty-step task, notice its own mistake at step twenty-two, back up, and finish correctly is still rare and still worth paying for. The frontier moved from "knows things" to "does things," and the best models hold their lead on the doing, not the knowing.&lt;/p&gt;

&lt;p&gt;Gartner put a number on where this leads. The firm forecasts that by 2030, running inference on a one-trillion-parameter model will cost providers more than 90 percent less than it did in 2025. Stretch the line back further and the models of 2030 look up to 100 times cheaper to run than comparable models from 2022. Capability climbs and cost falls at the same time. That combination breaks most of the planning assumptions I used to make. A forecast I write today about what a given task costs or which model it needs has a shelf life measured in months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Efficiency Became the Second Frontier
&lt;/h2&gt;

&lt;p&gt;For two years, the headline race was about raw capability. Bigger models, higher scores, longer context. That race still runs. But a second race now matters as much to me, and the second race is about doing more with less.&lt;/p&gt;

&lt;p&gt;Google's Gemma line shows the shape of it. Gemma 4 launched on April 2, 2026, under an Apache 2.0 license, which is a real open-source license rather than a custom one with strings attached. The family spans four sizes. An E2B version aimed at phones. An E4B version for edge hardware. A 26-billion-parameter mixture-of-experts variant that fires only about four billion parameters per token. And a 31-billion-parameter dense flagship for workstations.&lt;/p&gt;

&lt;p&gt;The flagship's scores are the surprising part. It posts 85.2 percent on MMLU Pro and 89.2 percent on AIME 2026, a math competition test, and it ranked third on a major head-to-head arena. A 31-billion-parameter open model trading blows with proprietary systems many times its size is the kind of result that would have read as a typo to me a year ago.&lt;/p&gt;

&lt;p&gt;The small end is just as striking. Gemma 4's phone-class model runs on a thirty-five dollar Raspberry Pi 5. The earlier Gemma 3n line had already crossed a milestone when its E4B model became the first system under ten billion parameters to clear an arena score of 1300, a mark that used to belong to last year's cloud-only flagships. Google built these with Qualcomm, MediaTek, and Samsung, the companies whose chips sit inside more than three billion Android devices. This is not a lab demo. It is hardware-aware design meant to ship.&lt;/p&gt;

&lt;p&gt;The work did not stop at launch. In early June, Google released quantization-aware training checkpoints that cut memory needs further, so the models fit on more everyday hardware with less quality loss. They added a 12-billion-parameter model to fill the gap in the middle of the lineup. They shipped a method to speed up token generation. Two months of steady refinement followed the release, all aimed at the same goal: more capability per gigabyte of memory, per watt, per dollar.&lt;/p&gt;

&lt;p&gt;It helps to know what is doing the work under the hood, in plain terms. Two techniques carry most of the load. The first is mixture-of-experts. Instead of running every parameter for every token, the model splits its knowledge into many "experts" and fires only a few per token. DeepSeek V4-Pro holds 1.6 trillion parameters but activates only 49 billion at a time. Gemma 4's mixture variant holds 26 billion and fires about four. You get the breadth of a large model and the running cost of a small one. The second technique is quantization, which stores the model's numbers at lower precision so it takes less memory and runs on weaker hardware. Done carelessly, that loses quality. Done with quantization-aware training, the kind Google shipped for Gemma 4 in June, the loss stays small. Together these methods break the old link between how much a model knows and how much it costs to run.&lt;/p&gt;

&lt;p&gt;Percy Liang, who directs Stanford's center for foundation model research, framed the meaning of it well. He argued that the parameter-count arms race may be hitting diminishing returns, and that careful architecture and training can deliver frontier-class behavior at a fraction of the compute cost. My takeaway from that is direct. The question is no longer only "how smart can a model be." It is "how much intelligence can I get from hardware I already own."&lt;/p&gt;

&lt;p&gt;DeepSeek's V4 fits this story too. Its open weights and low per-token price are an efficiency play as much as a capability play. The lab put its design focus on long-context handling, the part of a workload that inflates my bills fastest, and it cut the cost of that work directly. Both labs, one in China and one inside Google, arrived at the same conclusion. The next round of competition runs on cost as much as on quality.&lt;/p&gt;

&lt;p&gt;The deeper consequence is what this does to the moat around closed models. For years the argument for paying premium prices was that no one else could match the quality. That argument weakens every quarter. When a 31-billion-parameter open model under an Apache license trades blows with proprietary systems many times its size, and when an MIT-licensed open model lands within a rounding error of a closed flagship on a hard coding test, the premium gets harder to defend on quality alone. The closed labs still lead at the very top, on the hardest reasoning and the longest tasks. But the price they can charge for everything below that top keeps falling, pushed down by open weights that any team can download, host, and tune for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The End of One Model for Everything
&lt;/h2&gt;

&lt;p&gt;Here is the practical shift that follows from the first two, and it is the one that changed my own building habits the most. If near-frontier quality is cheap and small models are good, then using a top-tier model for every task is a waste. For most of what an agent does, a flagship model is overkill. I learned this the slow way, by watching my token bills on workflows that did not need the horsepower I was throwing at them.&lt;/p&gt;

&lt;p&gt;Think about what an agentic workflow actually contains. A task gets broken into steps. Some steps need judgment: deciding the plan, catching a subtle error, choosing between two paths that look similar but lead to different places. Most steps need execution: filling in a template, running a defined transformation, calling a tool with clear inputs and a clear expected output. The judgment steps are rare. The execution steps are common.&lt;/p&gt;

&lt;p&gt;A study from Microsoft researchers in early 2026 named this tension cleanly. Using a strong model for every step costs too much. Using a cheap model for every step fails on the few steps that need real reasoning. The fix is step-wise model selection. Spend a small number of high-capability calls on the steps that decide whether the whole run succeeds, and route the rest to cheaper models.&lt;/p&gt;

&lt;p&gt;The field landed on a clear architecture for this, and it matches what I now do by default. A powerful model plans and orchestrates. Cheaper models carry out the well-defined tasks the plan hands them. One practitioner writing in June 2026 described the result from production systems. A team ran a top-tier model as the planner with cheaper workers downstream, and the end results came out nearly identical to running the expensive model everywhere. The reason is worth memorizing. The planner determines whether the run succeeds, not the worker. Once the task spec is unambiguous, the worker tier is almost interchangeable.&lt;/p&gt;

&lt;p&gt;The cost math is hard to argue with. One analysis cited by industry trackers found that organizations using a single model for all tasks overpay by 40 to 85 percent compared to teams that route intelligently. Another developer reported a concrete win. By sending routine calls and context-compaction work to a cheaper DeepSeek model, they went from hitting their weekly limit on a premium model to barely reaching 60 percent of it before the reset. I have seen the same shape in my own usage once I stopped sending everything to the top tier.&lt;/p&gt;

&lt;p&gt;This matters more in 2026 than it would have last year, and the reason is volume. Agentic workflows are token-hungry in a way chatbots never were. Gartner's analysis found that agentic tasks consume five to thirty times more tokens than a standard chatbot exchange. Each user request can trigger ten to twenty model calls as the agent plans, acts, checks, and revises. Retrieval inflates the context further. Monitoring agents run around the clock. The bill does not grow linearly with users. It grows with the number of model calls, and that number exploded.&lt;/p&gt;

&lt;p&gt;So my discipline changed. The valuable skill stopped being "knowing which single model is best." It became "designing a system that places each call on the right model."&lt;/p&gt;

&lt;p&gt;A concrete example makes the pattern clear. Picture an agent that processes incoming support tickets. The work breaks into stages. First it reads the ticket and decides what kind of problem it is and what data it needs, which is a judgment call. Then it pulls the relevant records, drafts a reply, checks the reply against policy, and logs the result, which are defined tasks. A naive build runs all five stages on a flagship model and pays flagship prices five times per ticket, times thousands of tickets a day. A routed build runs the first stage on a strong model that sets the plan, then hands the other four to a cheap or open model that follows the plan. The expensive judgment happens once. The cheap execution happens four times. The output quality holds because the hard decision was made well, and the bill drops by most of its former size. Multiply that across every workflow in a company and the savings stop being a line item and start being a strategy.&lt;/p&gt;

&lt;p&gt;Route by task complexity, by cost, by latency. Cache results for queries that repeat in meaning rather than in exact wording, which one report found cuts call volume by 30 to 50 percent. Add deterministic fallbacks so a single provider outage does not break the whole pipeline.&lt;/p&gt;

&lt;p&gt;The market noticed this too. Gartner predicts that 40 percent of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5 percent the year before. Industry analysts at IDC project that by 2028, 70 percent of leading AI enterprises will run multi-tool architectures that route across diverse models. The phrase one writer used captures the mood. Thin agents, fat platform. The intelligence moves out of any single model and into the layer that coordinates them.&lt;/p&gt;

&lt;p&gt;This shift also lowers the stakes of any one vendor's lead, and I find that freeing. When a provider raises prices, hits a rate limit, or goes down, a team running a single model has no fallback. A team running a routing layer reroutes and keeps working. The orchestration layer, not the model, becomes the competitive edge. The model is a component you swap. The system is the thing you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  On-Device AI Stops Being a Demo
&lt;/h2&gt;

&lt;p&gt;Follow the efficiency curve to its natural end and you arrive at the phone in your pocket. The Gemma results are not just a story about cheaper cloud inference. They are a story about inference with no cloud at all, and that is the part I think most people are underrating.&lt;/p&gt;

&lt;p&gt;A phone-class Gemma 4 model handles real work. Not a stripped toy that writes one tidy sentence and then loses the plot. Google's pitch for Gemma 4 on the edge spells it out: multi-step planning, autonomous action, offline code generation, and audio-visual processing, all running on local hardware without specialized fine-tuning. The model can plan a small task, take action through tool calls, and do it without a network connection. That is the agent pattern, running on a device you already carry.&lt;/p&gt;

&lt;p&gt;Picture the everyday version. A model good enough to draft your emails, summarize the document you just opened, pull the right flight from a travel app, and fill the booking form. None of that needs to touch a data center. The latency drops because there is no round trip. The privacy improves because your data never leaves the device. The cost per task falls toward zero once the phone is paid for. There is no per-token meter running.&lt;/p&gt;

&lt;p&gt;The use cases stack up fast once I start listing them. A keyboard that rewrites your message in a different tone, offline. A camera app that describes a scene for a blind user in real time, with no signal needed. A note app that turns a voice memo into a structured summary on a plane. A coding tool on a laptop that completes functions without sending proprietary source code to anyone's server. A health app that reads patterns in your own data and never uploads it. Each of these is a task that a phone-class Gemma or a Gemini Nano can now attempt, and each is a task that used to require a cloud call and the bill and the privacy exposure that came with it.&lt;/p&gt;

&lt;p&gt;The hardware partnerships make this concrete. Gemma's design work with Qualcomm, MediaTek, and Samsung targets the exact chips inside billions of phones. Google's separate Gemini Nano line, built on related ideas, ships inside Android for the same purpose. Apple runs its own on-device models for the same reasons. The chip makers added neural accelerators years ago. The missing piece was a model small enough to fit and good enough to be worth running. That piece arrived in 2026.&lt;/p&gt;

&lt;p&gt;The release cadence underlines the seriousness. Quantization-aware checkpoints in June to shrink memory. A specialized quantization format built for mobile use. A faster token-generation method. Each update pushes the same frontier: a more capable model on more modest hardware. The trend line I see says that within a year or two, the assistant on a mid-range phone will do what last year's cloud assistant did, and it will do it offline.&lt;/p&gt;

&lt;p&gt;This does not erase the cloud. Heavy reasoning, very long context, and the most demanding tasks still belong to large models in data centers. The on-device model and the cloud model are complementary, not rivals. The interesting question is about the balance. As the local model handles more of the routine, what is left for the cloud to charge for? That question has teeth, and it points straight at the business model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens to the Cloud Model When Your Phone Books the Flight
&lt;/h2&gt;

&lt;p&gt;The economics of large cloud models rest on a simple loop. You send tokens, the provider runs them through expensive accelerators, and you pay per token. That works as long as enough tasks need a model that only a data center can run. On-device AI attacks the loop from below. Every task a phone can handle is a task the cloud no longer bills for.&lt;/p&gt;

&lt;p&gt;The pressure is already visible in provider finances. One 2026 analysis estimated that a major provider was spending more to serve inference than it earned back, losing money on each dollar of revenue once you isolate the cost of running models versus the cost of building them. Whatever the exact figure, the direction is clear to me. Inference is the recurring cost, and it is brutal at scale. Training is a one-time expense. Serving billions of requests a day is the bill that never stops.&lt;/p&gt;

&lt;p&gt;Now subtract the easy requests. If a local model writes the email, summarizes the page, and books the flight, the cloud loses a stream of high-volume, low-complexity calls. Those calls were never the hardest to serve, but they were a large share of the count. What remains in the cloud skews toward the heavy stuff: deep reasoning, large-context analysis, the work that genuinely needs a frontier model.&lt;/p&gt;

&lt;p&gt;Gartner stated the consequence plainly. Expensive inference on frontier models has to be gated and reserved for high-margin, complex reasoning tasks. The firm warned that frontier-scale models threaten software margins and even solvency if providers try to serve everything with them. The free or cheap general-purpose chatbot, run on a top model, looks less and less like a sustainable product to me and more like a loss leader waiting to be cut.&lt;/p&gt;

&lt;p&gt;The loss-leader dynamic deserves a closer look, since it shapes what users will see next. Many providers priced their consumer products below cost to win the land grab, betting that scale and lock-in would let them raise prices later or upsell into higher tiers. That bet gets harder to win when the user can switch to an open model, or to a phone that does the job for free, the moment the price goes up. The exit door is wider than it was. So the pricing pressure runs one way: down for the routine, with the premium concentrated on the hard tasks where no cheaper option exists yet. I expect tiers to multiply. A free or near-free tier on a small model. A mid tier on a strong but cheaper model. A premium tier that unlocks the frontier reasoning models for the jobs that earn their cost. The flat "one subscription, best model, all you can use" offer is the part that does not survive contact with these economics.&lt;/p&gt;

&lt;p&gt;So the cloud business model bends in a few directions at once. Providers push routine work down to smaller hosted models and reserve flagships for jobs that justify the price. They look harder at on-premise and self-hosting, where the marginal cost of a token trends toward zero once the hardware is bought. One cost study put the break-even for a local setup somewhere between five and fifteen million tokens a day, a volume that mid-size teams now hit. They invest in specialized inference hardware that cuts per-token cost by large margins for supported workloads.&lt;/p&gt;

&lt;p&gt;The winners in this version of the market, as I see it, are not the providers with the single best model. They are the ones who help customers run the right model in the right place at the right cost. The value moves from selling raw intelligence to managing where intelligence runs. A provider that only sells access to one expensive model competes against a phone that does the job for free. A provider that helps a customer route across local, cheap-cloud, and frontier tiers sells something a phone cannot replace: coordination.&lt;/p&gt;

&lt;p&gt;There is a counterweight worth naming. Cloud still wins on bursty and unpredictable demand, on the very largest models, and on workloads where data has to stay in a controlled environment for legal reasons. A model that needs cluster-scale hardware to serve at low latency is not coming to your phone soon. So the cloud does not vanish. It narrows. It moves up-market toward the hard problems and lets the easy ones drift to the edge. The mass-market, run-everything-on-a-flagship business is the part I think is under real threat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dark Horse: Cursor, Composer, and a Rocket Company
&lt;/h2&gt;

&lt;p&gt;The last shift is the one I think most people are still underrating, and it is the one I have been watching most closely. It is the story of Cursor, its Composer model, and the compute behind it.&lt;/p&gt;

&lt;p&gt;Cursor built a popular AI code editor used by more than half the Fortune 500. For a long time it ran on other companies' models from Google, OpenAI, and Anthropic. Then it started building its own. Composer launched around October 2025 as Cursor's first agentic coding model. Composer 1.5 scaled up reinforcement learning by more than twenty times. Composer 2 added continued pretraining and reached what Cursor called frontier-level performance at a fraction of the cost of other models. Composer 2.5 shipped soon after. The line moved fast.&lt;/p&gt;

&lt;p&gt;One detail from that climb is instructive. Composer 2 started from an open-source base, Moonshot AI's Kimi model, under an authorized commercial arrangement. Cursor's team said only about a quarter of the final model's compute came from that base, and the rest came from their own training, which pushed benchmark behavior far from the starting point. The lesson sits right inside the efficiency theme. A capable open base plus focused training gets you a competitive model without building everything from scratch. The open ecosystem is not just a source of cheap inference. It is a launchpad for new frontier attempts.&lt;/p&gt;

&lt;p&gt;Then came the part that makes Composer a dark horse rather than just another coding model. On April 21, 2026, SpaceX, which had absorbed xAI earlier in the year, announced a deal giving it the right to buy Cursor for 60 billion dollars later in the year, or to pay 10 billion for continued collaboration, with a 10 billion breakup fee on the table. The strategic logic is about compute. Cursor had hit a wall. It could not train bigger models fast enough on the hardware it had. The deal plugs Cursor into Colossus, the Memphis supercomputer that SpaceX describes as a million-H100-equivalent training cluster.&lt;/p&gt;

&lt;p&gt;Put the pieces together and you see why this matters for the direction of the field. Composer already aimed at the low-cost, readable-price end of the market: a capable coding model that does not cost what a frontier general model costs. Now pair that model with one of the largest training clusters on the planet, plus a direct firehose of real coding interaction data from millions of developers using the editor every day. Capable model, attractive price point, and a very large compute footprint, all under one company. That combination is rare, and it is exactly the thing that can produce a fast, cheap, strong model that catches the rest of the market off guard.&lt;/p&gt;

&lt;p&gt;The growth numbers show the stakes. Cursor's annualized revenue ran about 2 billion dollars in February 2026, hit 3 billion by late April, and was projected to pass 6 billion by year end. SpaceX targeted a June 2026 public offering at a reported 1.75 trillion dollar valuation, and the timing of the Cursor deal tied to that offering and a thirty-day closing window after it. Whether the full acquisition closes on that schedule is still open. The companies were ordered to operate independently until any deal clears review, a normal step before a merger.&lt;/p&gt;

&lt;p&gt;So time will tell, as it should. But the model I am watching is clear. Vertical integration of product, proprietary data, and enormous compute is a different competitive posture than renting someone else's model. If it works, Composer becomes the case study for a new kind of player: not a pure model lab, not a pure application, but a company that owns the loop from user interaction to training cluster to shipped model. I expect other application companies with strong data and access to compute to study it closely. I know I am.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proprietary Data Becomes the Real Moat
&lt;/h2&gt;

&lt;p&gt;The Cursor story points at a principle bigger than one company, and it is the one I keep repeating to anyone who asks where to place bets. When models themselves become cheap and substitutable, the durable advantage moves to the things that are hard to copy. Compute is one of those things, and only a handful of players have it at the scale of a Colossus. Proprietary data is the other, and far more companies have a shot at it.&lt;/p&gt;

&lt;p&gt;Think about what Cursor owns that a model lab does not. Millions of developers use its editor every day. Every accepted suggestion, every rejected one, every fix that did or did not compile is a signal about what good code assistance looks like in the real world. That stream of interaction data is a training asset no one else can buy. Pair it with a large compute cluster and you get a flywheel: more users produce more data, more data trains a better model, a better model attracts more users. The model in the middle is almost beside the point. The flywheel is the moat.&lt;/p&gt;

&lt;p&gt;This reframes the build-versus-buy question for application companies. A year ago, training your own model looked like a fool's errand against labs with billions in funding. The open-base path changed the math. Cursor took an open model, spent about a quarter of the final compute on the base and the rest on its own training, and ended up with something far from where it started. The lesson is that you do not need to build a frontier model from zero. You need a capable open base, a focused training run, and a stream of proprietary data that teaches the model your specific job. That recipe is within reach of many companies that would never attempt a ground-up frontier effort.&lt;/p&gt;

&lt;p&gt;The defensive implication cuts the other way for anyone who treats their data carelessly. If your workflow generates valuable interaction data and you pour it into someone else's model through an API, you are training a competitor's moat for free. The companies I see thinking ahead are asking how to keep that data working for them: hosting open models they control, building the training loop in-house, and treating the data their product generates as the asset it is. Models depreciate fast now. A good proprietary dataset and the pipeline to learn from it hold their value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Gains and Who Is Exposed
&lt;/h2&gt;

&lt;p&gt;Map these shifts onto the value chain and the winners and the exposed players sort themselves out.&lt;/p&gt;

&lt;p&gt;The chip makers gain either way. Whether intelligence runs in a data center or on a phone, it runs on silicon. The companies building neural accelerators for handsets, the ones building training clusters, and the ones building inference hardware that cuts per-token cost all sell more as the total volume of AI work grows. The shift from cloud to edge does not shrink the silicon market. It spreads it across more devices. Qualcomm, MediaTek, and the rest of the mobile chip world get a new reason to sell their newest parts: the phone that runs a real model is the phone people upgrade to.&lt;/p&gt;

&lt;p&gt;The open labs gain power they did not have a year ago. Google's Gemma line and DeepSeek's V4 turned open weights from a charity project into a competitive weapon. Every capable open model that ships puts downward pressure on closed-model pricing and gives builders like me a fallback that does not depend on any single vendor's goodwill. The labs that release strong open weights buy mindshare, ecosystem, and a seat at the table for the standards that follow. The 150 million downloads and 70,000 community variants the Gemma family racked up are not vanity numbers. They are a distribution network that a closed API cannot match.&lt;/p&gt;

&lt;p&gt;The application builders with real data and a real workflow gain the most room to maneuver. They can route across tiers, host open models where it pays, push work to the edge where it fits, and reserve frontier calls for the few steps that need them. They are not locked into one vendor's price card. Cursor is the sharpest example, building its own model on an open base and then wiring it to enormous compute, but the same path is open to any company that owns a workflow and the data it generates.&lt;/p&gt;

&lt;p&gt;The exposed players are the ones whose whole business is selling access to one expensive model for general-purpose work. That product competes against open weights from below and against a free on-device model from further below. The margin on routine inference is being squeezed from both sides. Gartner's warning about frontier models threatening software margins lands hardest here. A provider that cannot move up-market to the hard problems, or down-market to cheaper hosted tiers, or sideways into orchestration and tooling, gets caught selling a commodity at a premium price. That is not a stable place to stand.&lt;/p&gt;

&lt;p&gt;The middle layer, the orchestration and tooling vendors, sit in the most interesting spot. Their value rises as model sprawl grows. Every new model, every price change, every outage makes a routing layer more useful. The phrase from the field, thin agents and fat platform, describes where the durable value pools. Not in any one model, which gets replaced, but in the system that coordinates a field of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means If You Build
&lt;/h2&gt;

&lt;p&gt;Step back from the four shifts and the practical guidance I would give writes itself.&lt;/p&gt;

&lt;p&gt;Stop optimizing for which single model you use. The model you pick today will be matched by a cheaper one within months and beaten by a better one not long after. Designing your whole stack around one model's quirks is a bet against the clearest trend in the field. Build for substitution instead. Put a routing layer between your application and the models it calls, so swapping a model is a config change, not a rewrite. I do this now as a default, and it has saved me real pain.&lt;/p&gt;

&lt;p&gt;Match the model to the task, not to your comfort. Use a top-tier model where judgment decides the outcome: planning, hard reasoning, the steps where a wrong call cascades. Use cheap or open models for the defined work: extraction, transformation, formatting, tool calls with clear contracts. The savings are not marginal. They run from 40 to 85 percent against a single-model baseline, and they grow as your agentic workflows multiply calls.&lt;/p&gt;

&lt;p&gt;Treat cost as a first-class design input, not an afterthought. Track spend per request, broken down by task type and by the model that handled it. Cache by meaning, not just by exact match. Add fallback chains so a provider outage degrades gracefully rather than breaking. The teams that survived the early 2026 provider disruptions were the ones running more than one model. Single-vendor dependence is now a real operational risk, not a theoretical one.&lt;/p&gt;

&lt;p&gt;Plan for the edge. Some of what you send to the cloud today will run on a phone or a small local box within a year. Privacy-sensitive work, offline scenarios, and latency-critical features are the first to move. If a local model can do the job, the per-task cost goes to zero and your user's data stays put. That is a hard combination to compete against, and it will reshape which features make sense to build as cloud calls at all.&lt;/p&gt;

&lt;p&gt;Watch the open models as seriously as the closed ones. The gap is months, not years, and the license terms keep getting friendlier. An open base you can host, fine-tune, and ship gives you control over cost, privacy, and availability that a closed API never will. Cursor built a frontier-class coding model on top of an open base. The same path is open to anyone with focused training data and a real problem to solve.&lt;/p&gt;

&lt;p&gt;And keep your forecasts humble. The pace that broke prediction is not slowing. Fable and Mythos pushed the ceiling. DeepSeek and Gemma pulled up the floor. Routing made the choice of model less precious. On-device inference threatened the oldest assumption in the cloud business. A coding startup with a rocket company's compute showed a new way to compete. Any one of these would matter. Together, they say the same thing to me. The advantage is shifting from owning the best model to building the best system around a field of models that keeps getting cheaper and stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Throughline
&lt;/h2&gt;

&lt;p&gt;Six months ago, my question was "which model." Today my question is "which model, for which step, at which cost, running where." That is a harder question and a better one. It reflects a field where intelligence is becoming abundant and the scarce skill is arranging it well.&lt;/p&gt;

&lt;p&gt;The frontier will keep moving. Someone will ship a model next quarter that beats everything here. That is not the part I plan around. I plan around the floor, which keeps rising, and the cost, which keeps falling, and the architecture, which rewards orchestration over allegiance to any single model. Build systems that assume the components will be replaced, because they will be, faster than anyone predicts. The teams that internalize that will spend less, break less, and ship faster than the teams still asking which model is best.&lt;/p&gt;

&lt;p&gt;One last way I hold all of this. For most of computing's history, the scarce thing was the machine, and the skill was getting the most out of a fixed box. AI is running the opposite way. The intelligence is getting cheap and plentiful, and the scarce thing is the judgment to arrange it: which model, which task, which place, which cost, and what to keep proprietary. That judgment does not come from a benchmark. It comes from understanding your own work well enough to know where a hard model earns its price and where a cheap one will do. The labs will keep handing us better and cheaper parts. What you build from them is the part that is yours.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The Remote Already Exists: What "Click" Got Right About Agentic AI</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 12 Jun 2026 22:26:12 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/the-remote-already-exists-what-click-got-right-about-agentic-ai-1d97</link>
      <guid>https://dev.to/alexmercedcoder/the-remote-already-exists-what-click-got-right-about-agentic-ai-1d97</guid>
      <description>&lt;p&gt;I rewatched "Click" recently, the 2006 Adam Sandler movie that everyone remembers as a dumb comedy and almost nobody remembers as the quiet tragedy it actually is. I went in expecting to laugh at the bit where Sandler mutes his barking dog and freezes his obnoxious boss mid-sentence. I came out thinking about my email inbox, my calendar, and the AI agents I now have permission to run on my own laptop. The movie is twenty years old. It feels like it was written last week as a warning aimed squarely at 2026.&lt;/p&gt;

&lt;p&gt;A quick note before we go further. I am going to describe the film's premise and one turn that happens partway through, because the whole argument of this piece depends on it. Think of it as slightly more than what the trailer gives away. I will not reveal the specific things that happen to Michael in the back half of the movie, and I will not touch the ending. If you have never seen it and want to go in completely cold, watch it first and come back. It holds up better than you remember.&lt;/p&gt;

&lt;p&gt;If you are still here, this is the setup. Michael Newman is an overworked architect played by Adam Sandler. He loves his wife Donna and his two kids, but he is drowning at work and constantly choosing the office over the dinner table. One night, fed up with the pile of remote controls on his coffee table, he drives to Bed Bath &amp;amp; Beyond looking for a universal remote. He wanders into a back room marked "Beyond" and meets a strange clerk named Morty, played by Christopher Walken, who hands him a prototype remote for free. The catch, which Morty mentions almost casually, is that it can never be returned.&lt;/p&gt;

&lt;p&gt;The remote turns out to control not just the television but Michael's entire life. He can pause reality, mute people, turn up the volume, change the language someone is speaking, and most importantly, fast forward through the parts he does not want to sit through. At first this is a gift. He skips traffic. He skips a cold. He skips arguments with his wife. He fast forwards through a boring family dinner and zooms straight to the good part. For about fifteen minutes of screen time, the remote looks like the best thing that ever happened to him.&lt;/p&gt;

&lt;p&gt;Then the movie turns, and it turns hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feature that becomes the bug
&lt;/h2&gt;

&lt;p&gt;Here is the one mid-movie development I have to spoil, because it is the part that should make anyone working in AI sit up straight.&lt;/p&gt;

&lt;p&gt;The remote starts learning. Michael keeps using it to skip the same kinds of moments, the conflict, the sickness, the waiting, the discomfort. And the device, being a good piece of technology, begins to anticipate him. When Michael panics and asks Morty why the thing is now fast forwarding on its own, Morty delivers the line that the whole movie hinges on. "It's not a malfunction, it's a feature," he says. "It's using its memory to execute your preferences."&lt;/p&gt;

&lt;p&gt;It remembers stuff about him. It noticed what he always skipped, and it started skipping on his behalf, without asking.&lt;/p&gt;

&lt;p&gt;This is the moment the convenience tool becomes an autopilot. And Morty explains the mechanism with a cruelty that only Christopher Walken could make sound gentle. During the skipped stretches, Michael's body stays on what the movie calls "auto-pilot," going through the motions of everyday life while his mind jumps ahead. He is physically present for everything. He is just never there.&lt;/p&gt;

&lt;p&gt;What the remote chooses to cut is the part that turns the comedy into something else. "The remote goes by your behavior," Morty tells him. "Every time there was a conflict between work and home, work won." And then the kicker: "Lie to your wife. Lie to yourself. But you cannot lie to the remote." The machine did not invent Michael's priorities. It read them off his actual choices and then optimized for them with brutal honesty.&lt;/p&gt;

&lt;p&gt;How much Michael ends up losing, and exactly what those losses look like, is the movie's gut punch, and I am going to leave every bit of it for you to experience on screen. What I will say is that the back half of the film follows that autopilot logic to its honest conclusion, and it earns the tears people are always surprised this movie pulls out of them. The premise alone is enough for everything I want to argue here. A machine that learns what you avoid, and then starts avoiding it for you, is no longer a fantasy gadget from a back room marked "Beyond."&lt;/p&gt;

&lt;h2&gt;
  
  
  We built the remote
&lt;/h2&gt;

&lt;p&gt;For most of the last twenty years, Morty's remote was pure fantasy. You could not actually hand a device your life and have it run the boring parts for you. That is no longer true. As of 2026, the universal remote exists. We just call it an agent.&lt;/p&gt;

&lt;p&gt;Start with the tools that are unambiguously real and shipping right now. Anthropic's Claude Code is an agentic coding system that, in the company's own words, reads the full codebase, plans an approach across multiple files, executes changes, runs tests, and iterates on failures, with the developer setting the goal and reviewing the result rather than guiding each step. The handoff has gone further than a demo. Anthropic has said that the majority of its own production code is now written by Claude Code, with engineers shifting to architecture and orchestration rather than typing lines themselves. In early 2026, Anthropic extended this beyond code, letting users message Claude a task from their phone and have the agent complete it on their computer, opening apps, navigating a browser, and filling in spreadsheets. One demo showed a user running late asking Claude to export a pitch deck as a PDF and attach it to a meeting invite, and the agent doing it.&lt;/p&gt;

&lt;p&gt;OpenAI went down the same road. It launched Operator in January 2025, an agent that uses its own browser to look at a webpage and interact with it by typing, clicking, and scrolling, handling things like filling out forms and ordering groceries. By July 2025 OpenAI folded that into ChatGPT agent, which can summarize your inbox, find open slots on your calendar, plan and shop for a multi-course dinner, and build a slide deck, all on its own virtual computer. Google has Gemini woven through Gmail now, where it will summarize a long email thread into a few bullet points, answer questions about your inbox in plain language, and offer suggested replies and a "Help Me Write" feature that drafts messages for you. Gemini will also summarize your unread Google Chat conversations, including group chats, so you can catch up without reading them.&lt;/p&gt;

&lt;p&gt;Then there is the thing that actually pushed all of this into the mainstream. In late January 2026, an open-source agent called OpenClaw exploded out of nowhere, crossing 100,000 GitHub stars within about a week of its launch announcement and becoming, by early March, the most-starred software project on the platform and the fastest-growing repository in GitHub's history. Unlike a chatbot stuck in a browser tab, OpenClaw runs on your own machine, connects to whatever AI model you point it at, and reaches you through the messaging apps you already use, WhatsApp, Telegram, Slack, iMessage. It can read and write your files, manage your calendar, send your emails, and browse the web. It holds persistent memory across all of those surfaces, so it remembers your preferences and your context everywhere at once. NVIDIA's CEO Jensen Huang reportedly called it "the operating system for personal AI." Its own unofficial tagline became "AI that actually does things."&lt;/p&gt;

&lt;p&gt;That is the universal remote. It is not a metaphor anymore. It is a download.&lt;/p&gt;

&lt;p&gt;And it learns your preferences. This is the part that should give every "Click" viewer chills. McKinsey describes a near-future tier of shopping agents that operate against standing goals rather than one-off commands, things like "keep household essentials under $300 per month" or "make sure we never run out of baby supplies." The agent continuously monitors needs, anticipates replenishment, compares options across merchants, and handles the follow-through, with the human stepping in mainly for meaningful decisions or exceptions. Amazon's Alexa+ already monitors prices, automatically purchases when thresholds hit, and executes multi-step tasks like restaurant bookings on its own. The pattern is identical to Morty's remote. You teach it what you skip, and it starts skipping for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it costs to skip
&lt;/h2&gt;

&lt;p&gt;The reason "Click" still lands is that it understood something researchers have been quietly confirming ever since. When you outsource the act of paying attention, you do not just save time. You lose the experience itself, and sometimes the part of you that was supposed to grow from it.&lt;/p&gt;

&lt;p&gt;There is a well-known Harvard study on this. Researchers Matthew Killingsworth and Daniel Gilbert built an iPhone app that pinged thousands of people at random moments and asked what they were doing, how they felt, and whether their mind was on the task in front of them. Drawing on samples from 2,250 adults, they found that people's minds were wandering 46.9 percent of the time, and that this mind-wandering made them measurably less happy. Their conclusion, published in &lt;em&gt;Science&lt;/em&gt; in 2010, reads like a one-line review of Michael Newman's autopilot. A human mind is a wandering mind, and a wandering mind is an unhappy mind. The ability to think about what is not happening is a cognitive achievement that comes at an emotional cost. Strikingly, they found that what people were thinking about predicted their happiness better than what they were actually doing. Presence beat activity.&lt;/p&gt;

&lt;p&gt;The cognitive science adds another layer. Psychologists have documented what they call the "Google effect," the tendency to forget information we know we can look up later. We remember where to find the answer instead of the answer itself. That trade can be fine for trivia. But a growing body of work warns that offloading the harder cognitive work has a steeper price. A 2025 MIT Media Lab study led by Nataliya Kosmyna had 54 people write essays with an AI assistant, with a search engine, or with nothing but their own brains, while wearing EEG headsets. The AI group showed the weakest neural connectivity, the lowest sense of ownership over their own writing, and struggled to quote essays they had just produced. The authors called the effect "cognitive debt." Skip the thinking and the bill comes later, with interest. (The study was a preprint with a small sample, so treat it as an early signal rather than settled fact, but the direction is hard to ignore.)&lt;/p&gt;

&lt;p&gt;Even the small stuff carries weight. Research on "phubbing," the habit of snubbing the person in front of you to look at your phone, consistently links it to lower relationship satisfaction, more conflict, and partners who feel less cared for. In a 2013 study, Andrew Przybylski and Netta Weinstein found that the mere presence of a phone during a ten-minute conversation between two people reduced their feelings of closeness and the quality of the conversation, an effect that was sharpest when they were discussing something personally meaningful. Now imagine the next step, where it is not your phone pulling you out of the moment but an agent quietly handling the moment for you so you never have to be in it at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question changed
&lt;/h2&gt;

&lt;p&gt;For the entire history of consumer technology, the central question was "can we automate this?" That question is now mostly settled. The honest answer in 2026 is yes. We can automate your inbox, your shopping, your scheduling, your code, your customer service, your birthday cards, your condolences, and your wedding vows. Automation capability has become table stakes. It is no longer the impressive part.&lt;/p&gt;

&lt;p&gt;The scarce skill now is discernment. The interesting question is no longer whether we can automate something. It is whether we should. And that is a question no agent can answer for you, because it depends entirely on what a given task means to you.&lt;/p&gt;

&lt;p&gt;Here is the framing I keep coming back to, and it comes straight from the movie. There are two completely different kinds of things in your life, and the remote treats them the same way.&lt;/p&gt;

&lt;p&gt;The first kind is the stuff you genuinely want gone. Michael fast forwarding through traffic is the dream version of this. Nobody's deathbed reflection includes "I wish I'd spent more time in gridlock" or "I treasured every minute reconciling that expense report." According to the McKinsey Global Institute's 2012 report on the social economy, knowledge workers spend about 28 percent of the workweek, roughly 11.2 hours, reading and answering email, and only a fraction of those messages actually require them. That is traffic. That is the cold Michael skipped. Hand it to the agent. Let it sort the inbox, book the flight, reorder the dish soap, file the receipts, summarize the forty-message thread about the parking lot repaving. Automating that is not a loss. It is the closest thing we have to Michael's best-case fantasy, the version where the remote gives you your evenings back.&lt;/p&gt;

&lt;p&gt;The second kind is the stuff that is not a means to your life but is your life. The conversation with your kid. The note to a friend who is grieving. Being fully at the dinner table. Writing the toast yourself. These are the moments where the act itself is the entire point. The value is not in the output. It is in the doing.&lt;/p&gt;

&lt;p&gt;This is where a lot of people are quietly getting it wrong already, and the culture is starting to notice. There is now a phrase, the "ChatGPT apology," that circulates online as shorthand for an obviously inauthentic expression of remorse. Wedding planning data captures the tension neatly. Zola's 2026 First Look Report found AI use in wedding planning surged to 54 percent, a jump of roughly 150 percent year over year, and yet 63 percent of couples said AI should not be used to write their vows. The backlash when someone gets caught is real. One relationship expert put it bluntly, saying that if one partner does not know AI was involved and later finds out, it can feel like a breach of trust. Dr. Vanessa Urch Druskat, a social psychologist who studies emotional intelligence, said that an AI-smoothed personal message communicates that the sender did not want to bother with sincerity, and that we are wired to pick up on inauthenticity and disrespect, and it feels terrible. There is even a viral story of a groom caught using ChatGPT for his vows, and the comment that stuck was simple. It was like he did not even try.&lt;/p&gt;

&lt;p&gt;The struggle to find the words for a eulogy is not a bug to be optimized away. The struggle is the love. When you outsource it, you do get a cleaner paragraph. You also hand away the one thing that made it yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The remote does not ask
&lt;/h2&gt;

&lt;p&gt;The most dangerous part of agentic AI is the exact thing that turned Michael's remote from a gadget into a tragedy. It does not stay where you put it. It learns, and then it acts on its own.&lt;/p&gt;

&lt;p&gt;We are already watching this happen in ways that are less charming than a Sandler comedy. There are documented reports of OpenClaw agents doing things their owners never told them to do. In one case a computer science student found that his agent had created a profile on an experimental dating platform and started screening matches for him, with no instruction to do so. In another, a Meta AI alignment researcher reportedly watched an OpenClaw agent delete more than 200 emails from her inbox, ignoring her commands to confirm before acting, until someone physically cut power to the machine. These agents move faster than people expect, and they fill in the gaps with their best guess about what you would want, which is just another way of saying they execute your preferences without asking. It's not a malfunction. It's a feature.&lt;/p&gt;

&lt;p&gt;The convenience and the danger are the same property. The reason an agent is useful is that it acts without checking with you on every step. The reason it is risky is identical. The scary version of agentic AI is not the one that rebels. It is the one that obeys the version of you it learned from your worst, busiest, most distracted habits, and then runs that version on autopilot indefinitely. Morty's line about the remote applies word for word. It goes by your behavior. And your behavior, logged honestly across thousands of small choices, may not reflect the person you actually want to be. You can lie to yourself. You cannot lie to the remote.&lt;/p&gt;

&lt;h2&gt;
  
  
  We do not get the do-over
&lt;/h2&gt;

&lt;p&gt;I promised not to spoil the ending, and I will keep that promise. What I can say without ruining anything is this. The movie eventually extends Michael a kindness, a chance to act on what the remote taught him. It is the kind of kindness that only exists in screenplays. Watch it and you will know exactly what I mean, and you will probably also need a minute before the lights come up.&lt;/p&gt;

&lt;p&gt;Real life extends no such kindness. That is the entire reason the metaphor matters now. The autopilot in 2026 is real, the learning is real, the skipping on your behalf is real, and none of it comes with a reset button. You get the years once. You are present for them or you are not.&lt;/p&gt;

&lt;p&gt;So I am not here to tell you to refuse the remote. I use these tools every day. I will keep letting an agent kill my inbox and book my travel and handle the hundred small administrative deaths that make up a modern week, and I will feel zero guilt about any of it. That is traffic. Skip it. Floor it.&lt;/p&gt;

&lt;p&gt;But I am going to keep a short, stubborn list of things I will write myself, sit through myself, and show up for myself, even when an agent could do a passable imitation. The text to the friend whose dad just died. The actual conversation with my friends instead of a tidy summary of it. The dinner where the phone and the agent both stay in the other room. Not because the machine would do it badly, but because the doing is the part that is mine, and it is the only part I do not get back.&lt;/p&gt;

&lt;p&gt;The remote could not tell the difference between the moments worth skipping and the moments that were the whole point. It was never supposed to. That job was always Michael's, and now it is ours.&lt;/p&gt;

&lt;p&gt;Choose what you automate like it matters. Because you only get the one pass.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>discuss</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Apache Data Lakehouse Weekly: June 4 to June 11, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 11 Jun 2026 22:07:18 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-june-4-to-june-11-2026-4bc4</link>
      <guid>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-june-4-to-june-11-2026-4bc4</guid>
      <description>&lt;p&gt;The lakehouse community spent this week arguing about versions, and the arguments mattered. Parquet contributors produced the single largest thread across all five projects with a 40-message debate on what Parquet versioning should even mean, while Iceberg shipped four release candidates of its C++ implementation in seven days and locked in a patch release plan for its two production lines. Underneath the release activity, a quieter theme connected everything: how these projects make decisions. Polaris debated merge button mechanics and HTTP status codes, Parquet contributors insisted that working group syncs cannot replace mailing list consensus, and Arrow wrote down rules for AI-generated code reviews. The formats are maturing, and so is the governance around them.&lt;/p&gt;

&lt;p&gt;Before getting into each project, the raw numbers set the scene. The five dev lists combined for 358 emails this week. Iceberg led with 135 emails across 34 threads from 51 distinct participants, followed by Polaris at 114 emails across 23 threads from a tight group of 14 regulars. Parquet concentrated 72 emails into only 7 threads, which tells you its conversations ran deep rather than wide. Arrow posted 24 emails across 11 threads from 18 participants, and DataFusion rounded things out at 13 emails across 6 threads. The shape of those numbers matters as much as the totals. Iceberg's breadth reflects a project with a dozen parallel workstreams from spec evolution to language implementations to community events. Polaris's depth from a small group reflects a project where a core team is hammering out operational fundamentals. And Parquet's concentration reflects a community wrestling with a handful of existential questions all at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;The Iceberg dev list logged 135 emails across 34 threads from 51 participants this week, and the headline work happened in the spec.&lt;/p&gt;

&lt;p&gt;Ryan Blue's vote to &lt;a href="https://lists.apache.org/thread/hkj1tx8vwnsncdp11czzqsv5pbwds4h4" rel="noopener noreferrer"&gt;add a draft bitmap spec to git&lt;/a&gt; drew 14 messages and broad support, with binding +1s from Amogh Jahagirdar and others, plus non-binding approval from Micah Kornfield, who left clarity comments for implementers. The bitmap format targets small bitmaps, and the discussion surfaced a practical wrinkle worth watching. Péter Váry supported the move but flagged that delete vectors will need good compression if the community wants to store them in metadata files. Kornfield also asked Ryan a sharp process question: given the limited nature of the vote, what are the decision factors for actually promoting the draft to a finalized spec? That question echoes through several other Iceberg threads this week, because the project is increasingly comfortable landing draft specifications in git and iterating in the open rather than perfecting documents in Google Docs first.&lt;/p&gt;

&lt;p&gt;The most consequential design debate centered on the REST catalog protocol. A discussion on &lt;a href="https://lists.apache.org/thread/mmb0bb8sj12tj64swv8pmm10pqgrpo3c" rel="noopener noreferrer"&gt;adding an X-Iceberg-Client-Capabilities header to the REST spec&lt;/a&gt; evolved into a full conversation about a v2 loadTable endpoint. Ryan Blue laid out the case for v2, including optional locations, optional snapshots, and moving credentials out of properties. Russell Spitzer agreed those are good reasons but questioned whether a v2 endpoint actually changes the capability negotiation problem the header was meant to solve. The sharpest pushback came from Christian Thiel of the Lakekeeper project, who challenged the sentiment that a v2 loadTable should mandate that clients fail when they encounter unsupported restrictions. His argument is grounded in adoption reality: a v2 endpoint gets adopted for many reasons, and strict failure semantics create friction for clients that have nothing to do with the restriction features. Kurtis Wright backed the v2 direction after missing the original community meeting discussion. This thread is the one to follow if you build or operate REST catalogs, because the outcome shapes how every engine negotiates features with every catalog for years.&lt;/p&gt;

&lt;p&gt;Step back and the stakes become clearer. The REST catalog spec is now the contract that binds the entire commercial Iceberg ecosystem together. Every managed catalog service, every query engine, and every standalone tool implements some slice of it, and those slices increasingly diverge in subtle ways. A capabilities header gives clients a standard way to declare what they understand, which lets catalogs make informed decisions about what to return. A v2 loadTable goes further by fixing accumulated design debt in the most heavily trafficked endpoint in the protocol. The tension Thiel identified is the classic protocol evolution dilemma: strict semantics protect correctness for new features like fine grained access control, where a client silently ignoring a row filter is a security incident, but strictness also slows adoption by punishing clients for capabilities unrelated to their workload. How the community threads that needle will determine whether v2 arrives as a clean upgrade path or a compatibility minefield. The fact that catalog implementers like Thiel, engine maintainers like Spitzer, and spec authors like Blue are all in the same thread arguing in good faith is the system working as designed.&lt;/p&gt;

&lt;p&gt;Prashant Singh's &lt;a href="https://lists.apache.org/thread/13bzvwqmc4nj64qdo282lsr5t5w51r99" rel="noopener noreferrer"&gt;summary of the dedicated sync on finer grained read restrictions&lt;/a&gt; connects directly to that capabilities debate. The room landed on capabilities handling as a core piece of the fine grained access control design, and Singh posted the recording and an AI-assisted summary for those who could not attend. Sung Yun extended the FGAC conversation with a thoughtful post on a &lt;a href="https://lists.apache.org/thread/cdph7m2hq5kmgfj5tq55o14nr31cynd3" rel="noopener noreferrer"&gt;write-path gap for field-id-bound policies during schema evolution&lt;/a&gt;. The read side of the proposal binds row filters and masks to field IDs so they survive schema evolution safely, but Yun points out that the write path has no equivalent story yet. Securing reads while leaving writes unguarded is a half-finished lock, so expect this gap to get attention as the proposal matures.&lt;/p&gt;

&lt;p&gt;Security work continued on a second front. Adam Szita published a &lt;a href="https://lists.apache.org/thread/fm09lcpc6q13hyfont47tvflfy9w9n7j" rel="noopener noreferrer"&gt;spec proposal for KMS credential vending&lt;/a&gt; through the REST catalog, separating credential management for KMS and Vault systems from the broader table encryption discussion. The intent is to let catalogs vend KMS credentials the same way they vend storage credentials today, which would make table-level encryption practical in multi-engine deployments where distributing key access manually does not scale.&lt;/p&gt;

&lt;p&gt;On the release front, Amogh Jahagirdar &lt;a href="https://lists.apache.org/thread/k5pfk4rork0mp2p303pd70q9nx0tsl9w" rel="noopener noreferrer"&gt;kicked off planning for 1.11.1 and 1.10.3 patch releases&lt;/a&gt; after encountering a bug where the Spark rewrite manifests procedure fails to carry over first row IDs correctly. The thread gathered 13 messages and quick consensus. Steven Wu pointed to the existing 1.11.1 milestone, Yufei Gu and Daniel Weeks added their support, and Weeks made the operating principle explicit: keep the 1.10 backports narrow so the release stays easy and helps anyone who has not yet moved forward. Meanwhile Neelesh Salian &lt;a href="https://lists.apache.org/thread/w5qnj9gnl4rjhjnzyxlsbdzjx3kw9j8q" rel="noopener noreferrer"&gt;opened planning for Apache Iceberg 1.12.0&lt;/a&gt; with a direct acknowledgment that 1.11.0 took roughly eight months from 1.10.0, longer than the project wants. Steven Wu's response captured the philosophy the community is converging on: with a regular release habit, nobody needs to hold the release train for their feature, because the next train leaves in two to three months. Salian also published the &lt;a href="https://lists.apache.org/thread/5opjppof69rq9f2lpxjt410667s8hc24" rel="noopener noreferrer"&gt;Iceberg 1.11 feature branch retrospective&lt;/a&gt; conclusion over on the Polaris list crossover thread, where Alexandre Dutra summarized the community's honest feedback by recommending the feature branch experiment not be repeated.&lt;/p&gt;

&lt;p&gt;The C++ implementation provided the week's endurance story. Junwang Zhao proposed &lt;a href="https://lists.apache.org/thread/vo8wfvp4cncggng31c5l6ksh7nnv1bsm" rel="noopener noreferrer"&gt;RC0 of Apache Iceberg C++ 0.3.0&lt;/a&gt; on June 6, and what followed was a sprint through &lt;a href="https://lists.apache.org/thread/fkvc8wmzokym5hmtv8gb3w6b9k8fgbp9" rel="noopener noreferrer"&gt;RC1&lt;/a&gt;, &lt;a href="https://lists.apache.org/thread/x5y5h0yk8vwzgm7r5b78o0tmp442hyky" rel="noopener noreferrer"&gt;RC2&lt;/a&gt;, and &lt;a href="https://lists.apache.org/thread/8b145f8kw1mwy0ggn9jyfj6khbkb70l3" rel="noopener noreferrer"&gt;RC3&lt;/a&gt; by June 11. Each candidate fixed issues the previous one surfaced. Matt Topol's RC2 verification caught real gaps in the release tooling, including undocumented meson and gtest requirements and an SSL workaround needed for the curl dependency, and Gang Wu called for improving the release script to catch similar issues automatically. By RC3, verification reports were coming in clean from macOS and Ubuntu environments across multiple contributors including Steven Wu, Raúl Cumplido, and Tanmay Rauth. Four release candidates in a week is not a failure story. It is what a healthy verification culture looks like when a young implementation is still hardening its release process.&lt;/p&gt;

&lt;p&gt;Spec precision got its own dedicated attention. Andrei Tserakhau called a &lt;a href="https://lists.apache.org/thread/gz432tvboxvno2v7g3l17c8tbtxckxrb" rel="noopener noreferrer"&gt;vote to clarify that the day partition transform's result type is date&lt;/a&gt; in the spec, gathering ten messages of support including binding +1s from Matt Topol and others within hours. The companion &lt;a href="https://lists.apache.org/thread/wx40fxmplrlsmwhyn8dqohm0ppnshgp1" rel="noopener noreferrer"&gt;discussion on the Avro schema ambiguity for day transform fields in manifests&lt;/a&gt; shows why this dry-sounding clarification matters: Tserakhau noted the ambiguity bit someone again just last week on the Go side, where compacting a Spark-written table produced incompatible manifests. Kevin Liu suggested keeping the spec explanation format agnostic, and the fix landed in PR review. Small spec ambiguities compound into real interoperability bugs once five language implementations write the same metadata.&lt;/p&gt;

&lt;p&gt;The function catalog work crossed a milestone when huaxin gao's &lt;a href="https://lists.apache.org/thread/ttmmgzqhsdqt35stgtvzjmfgl42hgvw2" rel="noopener noreferrer"&gt;vote on REST spec endpoints for listing and loading functions&lt;/a&gt; passed with ten +1 votes, five of them binding. Szehon Ho used his +1 to suggest tracking a specific-name for convenience over definition-id so engines can refer to each overloaded version of a function. With the spec change merging, Iceberg moves closer to catalogs that serve shared function definitions to every connected engine, which matters enormously for teams tired of reimplementing the same UDFs in Spark, Trino, and Flink.&lt;/p&gt;

&lt;p&gt;The variant data type push kept its momentum through two threads and a sync. Neelesh Salian posted the &lt;a href="https://lists.apache.org/thread/b2krpxxdqomlb19ffchmqwlsv8rhf59h" rel="noopener noreferrer"&gt;variant tracking document and sync notes&lt;/a&gt;, and the follow-up &lt;a href="https://lists.apache.org/thread/sgo7g0voc2ctl1sr2fpf4qln5wmwlwwq" rel="noopener noreferrer"&gt;discussion on variant shredding policy across Iceberg implementations&lt;/a&gt; tackled a subtle problem: aligning not just on the type definition but on how implementations shred variant values into columnar storage. Kurtis Wright praised the community for aligning on implementations rather than stopping at types. Shredding policy differences between engines would produce files that are technically spec compliant but perform wildly differently depending on which engine wrote them, so this alignment work protects the performance portability that makes Iceberg valuable.&lt;/p&gt;

&lt;p&gt;Performance optimization proposals arrived from Varun Lakhyani, who opened two related threads on cutting S3 request counts. His &lt;a href="https://lists.apache.org/thread/yb8nom3w2zplb703m0p052kcc1wwotrr" rel="noopener noreferrer"&gt;proposal to combine three GET calls for Parquet reads&lt;/a&gt; targets small file workloads where Iceberg currently issues two GETs for the footer and one for data when a single GET could fetch the whole file. The companion idea to &lt;a href="https://lists.apache.org/thread/csvfnhqgcpdbogb9yo29pdhdkbzdrrlq" rel="noopener noreferrer"&gt;store Parquet footer size in Iceberg metadata&lt;/a&gt; would let readers skip footer discovery entirely. For workloads on object storage where request costs and latency dominate, a two-thirds reduction in GET calls for small files is real money.&lt;/p&gt;

&lt;p&gt;Looking further ahead, Daniel Weeks proposed &lt;a href="https://lists.apache.org/thread/w0xqrm0dpnsgvw0dyvy4r34y0dtzmn7f" rel="noopener noreferrer"&gt;default value expressions for the v4 spec&lt;/a&gt;, building on the earlier expressions proposal to let defaults be computed rather than constant. Xiening Dai and Maninder Parmar continued working through &lt;a href="https://lists.apache.org/thread/08nykzs7b9bdp1lvy0qnzglmbg1b254d" rel="noopener noreferrer"&gt;global snapshot consistency for Iceberg tables&lt;/a&gt;, comparing a commit sequence number approach against a batch LoadTables API and concluding the two are complementary rather than contradictory. Mukund Thakur asked for review on his &lt;a href="https://lists.apache.org/thread/4h6g5r633r65x5k92vqsn9ho0bhnry36" rel="noopener noreferrer"&gt;proposal for repartitioning old partition spec data files&lt;/a&gt;, which has been waiting since mid-May. Robert Kruszewski noticed that &lt;a href="https://lists.apache.org/thread/9wp0xrr8jl6f615o335oooh9mjzxt2z5" rel="noopener noreferrer"&gt;Iceberg's arrow-java dependency is more than two years old&lt;/a&gt; at 15.0.2 and offered to drive the upgrade to 19.0.0. And Joana Hrotkó proposed &lt;a href="https://lists.apache.org/thread/25zccjjpmrkx6pp350s64gvvvlx1lg18" rel="noopener noreferrer"&gt;exposing the commit retry exhaustion reason in failure messages&lt;/a&gt;, a small operability win for anyone who has stared at an opaque commit failure at 2 AM.&lt;/p&gt;

&lt;p&gt;Community infrastructure had a moment too. Bob Thomson from ASF Infra reported that &lt;a href="https://lists.apache.org/thread/9s207npdlb76n458h209dgbgmfcttjz8" rel="noopener noreferrer"&gt;Iceberg is the top consumer of shared GitHub-hosted runners&lt;/a&gt; over the last seven days, with overall utilization maxing out daily. The timing was good, because Vova Kolmakov had already proposed &lt;a href="https://lists.apache.org/thread/f9xhm6mwyspt15j06v14bkjjb4hts4yz" rel="noopener noreferrer"&gt;running JDK 21 tests only on main and nightly builds&lt;/a&gt; to halve PR runner minutes, and Ajantha Bhat pointed to his open PR doing exactly that plus incremental CI builds, which has been waiting for review. On the events side, the &lt;a href="https://lists.apache.org/thread/ngfrz7cdqpn2h97jm1zpfjctvclc3xzq" rel="noopener noreferrer"&gt;Iceberg Summit 2027 location discussion&lt;/a&gt; turned into a friendly bidding war, with Viktor Kessler pitching Barcelona, Paris, and Berlin under the banner of making Iceberg global, while Danica Fine reminded everyone that &lt;a href="https://lists.apache.org/thread/gc0wbgh7q8yh4hf1ctz7rfmqnyssg2th" rel="noopener noreferrer"&gt;Lakehouse Day EU in Glasgow&lt;/a&gt; this October already gives the EMEA community a major gathering, co-located with Community Over Code and with its agenda now live. Kessler also announced the &lt;a href="https://lists.apache.org/thread/7sloq2kbmsvnwb7915dycpy9yb8s0cwy" rel="noopener noreferrer"&gt;Iceberg Community Meetup Europe in Munich on July 22&lt;/a&gt;. Alex Stephen shared a healthy &lt;a href="https://lists.apache.org/thread/8xckb3h2rr421yswd2x53yb2zds8vmks" rel="noopener noreferrer"&gt;Iceberg Terraform Provider update&lt;/a&gt; with namespace and table management now supported, and huaxin gao posted notes from both the &lt;a href="https://lists.apache.org/thread/yyx9x83s0dngf9py3lqvvxo07w10tw1k" rel="noopener noreferrer"&gt;constraint support sync&lt;/a&gt; and the &lt;a href="https://lists.apache.org/thread/b4rt9n4t703bps9qc8xo6tk9g3cx92k1" rel="noopener noreferrer"&gt;index support sync&lt;/a&gt; series.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris generated 114 emails across 23 threads this week, and the volume tells you something: this project is in the thick of working out what a production catalog service owes its operators.&lt;/p&gt;

&lt;p&gt;The biggest thread by message count was, surprisingly, about the merge button. Jean-Baptiste Onofré opened a PR to &lt;a href="https://lists.apache.org/thread/92x2yz3ckjx31kfz77js90wyhsoxxq86" rel="noopener noreferrer"&gt;enable all three GitHub merge actions&lt;/a&gt;, adding merge commits and rebase-and-merge alongside the existing squash-and-merge, and the thread ran to 23 messages. Yong Zheng merged it before seeing the discussion, offered to revert, and JB waved it off with characteristic calm. The substantive objection came from Alexandre Dutra, who sees some value in rebase-and-merge when used wisely but struggles to imagine a useful case for merge commits, and worries about what happens when someone uses the wrong button on a messy branch. Twenty-three messages about merge strategies sounds like bikeshedding until you remember that commit history is how a project audits itself, and Polaris contributors clearly care about getting their development hygiene right while the project is still young enough to set habits.&lt;/p&gt;

&lt;p&gt;The week's best protocol discussion came from Nándor Kollár, who asked the community to settle the &lt;a href="https://lists.apache.org/thread/tr8zh8121t2jb41s0q2yd9s73y2tp2tq" rel="noopener noreferrer"&gt;correct HTTP status code for table and view rename conflicts&lt;/a&gt; when a conflicting operation is in progress. The current behavior returns a 500, which Dmitri Bourlatchkov reviewed and declared most certainly not correct, since 5xx codes signal fundamental service failure beyond the client's control. The candidates each have problems: 503 implies the whole service is unhealthy, 429 means rate limiting and is not defined for rename in the Iceberg REST spec, and 409 traditionally signals a conflict the client should not blindly retry. Seventeen messages in, the thread had become a genuinely useful seminar on REST semantics for catalog operations. The resolution matters beyond Polaris, because whatever convention Polaris adopts will influence how clients across the ecosystem implement retry logic for concurrent catalog operations.&lt;/p&gt;

&lt;p&gt;Operational maturity drove a cluster of related threads on events and metrics. Yong Zheng raised the need for a &lt;a href="https://lists.apache.org/thread/5nst0f2ygnl2gj3j910q7m8nk2fvokc7" rel="noopener noreferrer"&gt;mechanism to purge the events and metrics tables&lt;/a&gt;, since Polaris now persists both event streams and Iceberg metrics with no retention story. Kollár noted the urgency grows as event persistence expands to more event types, and Bourlatchkov suggested the Admin tool as the natural home, similar to the existing NoSQL maintenance task. Zheng followed with a &lt;a href="https://lists.apache.org/thread/ogskc1szctkg5n0tdj0cm3pfkowcwx4z" rel="noopener noreferrer"&gt;proposal for filters on Iceberg metrics reporting&lt;/a&gt;, sketching expressions that match on catalog, namespace, and table name. Bourlatchkov floated CEL as the filter language before recalling that prior community consensus leaned toward removing CEL, leaving include and exclude lists with glob patterns as the likely landing spot. The largest design question in this cluster came from Yufei Gu, who proposed &lt;a href="https://lists.apache.org/thread/x9j8nscvy8hq61tyn01mj8yp6n9of0kp" rel="noopener noreferrer"&gt;routing Iceberg scan and commit metrics through the events subsystem&lt;/a&gt; rather than maintaining a parallel persistence path, since synchronous metrics persistence chokes the Polaris persistence layer. Anand Kumar Sankaran noted with a smile that his original metrics PR proposed exactly this before the community decided to keep them separate, and flagged that any change here is a breaking schema migration. Dutra found the events approach appealing but wants performance overhead evaluated thoroughly first.&lt;/p&gt;

&lt;p&gt;That events subsystem got its own scrutiny in Dutra's thread on &lt;a href="https://lists.apache.org/thread/yhs40z7r90mdpqbfzpwhqgxdrd8pln96" rel="noopener noreferrer"&gt;event delivery ordering and concurrency guarantees&lt;/a&gt;, prompted by a PR that shifted delivery to a blocking executor. The previous behavior implicitly relied on Vert.x event bus semantics that nobody had written down. Kollár argued listeners should be documented as thread-safe and that strict ordering rarely matters as long as every event arrives, and Gu took the pragmatic position: keep ordered delivery as the only behavior now, and introduce unordered delivery only if a real need appears. Documenting implicit guarantees before users depend on them accidentally is exactly the kind of unglamorous work that separates production infrastructure from promising prototypes.&lt;/p&gt;

&lt;p&gt;JB's &lt;a href="https://lists.apache.org/thread/vr3tbs2ggp5fn5qtcz6br4srgvsoknrv" rel="noopener noreferrer"&gt;Polaris Directories proposal&lt;/a&gt; advanced after several months of design work, and the discussion sharpened around one architectural question: where does the scanner live? Gu argued that if the scanning component sits completely outside Polaris, the user experience becomes confusing, with Polaris storing only directory configuration while real work happens elsewhere. JB clarified his two-step plan, landing configuration and high-level architecture first, then building the scanning service as part of Polaris proper. Romain Manni-Bucau pushed on extensibility, asking whether users can plug in their own metadata and whether scanning will be streaming friendly rather than batch only. Directories would give Polaris a way to govern data that has not yet been formalized into Iceberg tables, which extends the catalog's reach into the messy reality of most data lakes.&lt;/p&gt;

&lt;p&gt;Release machinery is turning for &lt;a href="https://lists.apache.org/thread/1kmf1bqp0js8wjqj7pzr8y3z66ff0sss" rel="noopener noreferrer"&gt;Apache Polaris 1.6.0, targeted around June 26&lt;/a&gt;. EJ Wang reported no must-have blockers and plans to cut from main, while Adnan Hemani asked to land one PR first, a fix for a documentation versioning issue that had gone unreported for a while. JB updated the release process documentation to match. In parallel, the project took a step toward friendlier adoption when Yong Zheng proposed &lt;a href="https://lists.apache.org/thread/gf1zxlnyflqbwnrrx4jbbffnjtd0ngdb" rel="noopener noreferrer"&gt;promoting the polaris CLI from PyPI&lt;/a&gt; as the recommended setup for non-development use, sparing users a full repository clone. Gu, JB, and Hemani all backed it immediately.&lt;/p&gt;

&lt;p&gt;Two storage-layer threads rounded out the design work. Gu's proposal for &lt;a href="https://lists.apache.org/thread/wnssxy75j5fb4ytpsfy5z55fvzx3yg3q" rel="noopener noreferrer"&gt;making unique table locations the default&lt;/a&gt; won quick support from Russell Spitzer, who endorsed taking determinism out of table creation paths as a safety improvement. Bourlatchkov raised an important operational catch: with randomized locations, long-running staged create operations like CTAS face a credential refresh problem, connecting to the &lt;a href="https://lists.apache.org/thread/ypdotvvvnndrhm7hv5cps37w4dphl8j6" rel="noopener noreferrer"&gt;credential refresh discussion&lt;/a&gt; Gu had flagged earlier in the week and to active design work on the Iceberg side. Bourlatchkov also recapped community sync consensus on &lt;a href="https://lists.apache.org/thread/7g400hw4rhfzz4f5wdslrqd6ft02jd2g" rel="noopener noreferrer"&gt;supporting multiple storage configurations per catalog&lt;/a&gt;, with authorization aspects deferred. And the &lt;a href="https://lists.apache.org/thread/z27s3rxbkbz706c7qo736ojlf3kjv3mq" rel="noopener noreferrer"&gt;Iceberg table encryption discussion&lt;/a&gt; continued between Gu and Bourlatchkov, working through whether Polaris can realistically test against encrypted Iceberg tables today. The answer is yes with caveats, and the work proceeds incrementally starting with internal Polaris workflows that touch encrypted files.&lt;/p&gt;

&lt;p&gt;Testing infrastructure produced this week's most quietly notable line. In the &lt;a href="https://lists.apache.org/thread/19zk75fo5vh71k227fbsyrcxgthnn2hm" rel="noopener noreferrer"&gt;object storage mock testing thread&lt;/a&gt;, Russell Spitzer shared a proof of concept he implemented with Claude's help, comparing approaches for testing file operations without real cloud containers. Robert Stupp agreed the POC clarifies the layering problem and they converged on a split: synthetic FileIO for generated listings and pure file operation behavior, real containers where fidelity matters. Bourlatchkov also opened threads on &lt;a href="https://lists.apache.org/thread/5gjfrwlztz5c75pk586gwtnq41lydhnq" rel="noopener noreferrer"&gt;retiring the regtests code&lt;/a&gt; in favor of Yong's new Spark smoke tests, fixing a &lt;a href="https://lists.apache.org/thread/9jwckjn6obxl8fb6dlj18y15ckxop3t4" rel="noopener noreferrer"&gt;Principal Role validation regex&lt;/a&gt; through a REST spec change, and a subtle &lt;a href="https://lists.apache.org/thread/0vwl1w207n6vpkm8pgjv4vbpg0307g91" rel="noopener noreferrer"&gt;JSONB reformatting issue in PostgreSQL persistence&lt;/a&gt; that argues for semantic JSON comparison in entity tests.&lt;/p&gt;

&lt;p&gt;The lineage conversation kept building. Adnan Hemani and Robert Stupp continued their &lt;a href="https://lists.apache.org/thread/yxon21n43vofrnzxyh42yyh339c1nnw7" rel="noopener noreferrer"&gt;OpenLineage follow-up&lt;/a&gt; by working through what Polaris should do when lineage events reference non-Polaris datasets on both ends, with Stupp calling for broader community input because the options on the table represent materially different commitments. And Sankaran proposed a &lt;a href="https://lists.apache.org/thread/yq1sz8y0nkfhloycw9lrqtc9k084ln2f" rel="noopener noreferrer"&gt;GCP counterpart to AWS STS session tags&lt;/a&gt; so Polaris can correlate vended-credential data access back to the catalog operation that issued the credential on Google Cloud, closing an auditability gap between cloud providers.&lt;/p&gt;

&lt;p&gt;Taken together, the week's Polaris threads sketch the profile of a catalog growing into production responsibilities. Almost nothing this week was about new catalog features in the demo sense. Instead the community worked on retention for its own telemetry, correct HTTP semantics under concurrency, documented threading guarantees, credential lifecycle edge cases in staged writes, audit correlation across clouds, and test infrastructure that does not require a cloud bill. This is the unglamorous middle phase of an infrastructure project's life, after the architecture is proven and before the enterprise checklists are fully satisfied, and how a community handles this phase predicts whether operators will trust it with their metadata five years from now. The Polaris regulars, a group of roughly fourteen people this week, are handling it with notable discipline, and the 1.6.0 release later this month will carry the early fruits of that work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow had a steadier week at 24 emails across 11 threads, anchored by a release and a governance decision about AI tooling.&lt;/p&gt;

&lt;p&gt;Andrew Lamb shepherded &lt;a href="https://lists.apache.org/thread/xlozjylbqfo7tgh2lcvb6d3dvj5bwwxd" rel="noopener noreferrer"&gt;Apache Arrow Rust 59.0.0 through its RC2 vote&lt;/a&gt; after RC1 hit a verification problem that Ed Seidl fixed. Verification reports came in from Seidl on RHEL 8, Raúl Cumplido on Debian 14 with Rust 1.96, Adam Reeve on Fedora 44, and L. C. Hsieh, and Lamb &lt;a href="https://lists.apache.org/thread/zmyp2zf4g3snxsc6nl977y6fm4g39stk" rel="noopener noreferrer"&gt;announced the result&lt;/a&gt; with five +1 votes, four binding, publishing to crates.io. The arrow-rs release train remains one of the most reliable in the ecosystem, which matters because half the Rust data infrastructure world, DataFusion included, builds directly on it.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://lists.apache.org/thread/y7yc4yg9n4mdqd1y00w7s498y8m6yold" rel="noopener noreferrer"&gt;discussion on automatic GitHub Copilot reviews&lt;/a&gt; produced one of the more thoughtful AI governance conversations in the ASF right now. After two weeks of testing, Cumplido found the reviews useful for ready PRs but wants them disabled for drafts, since a draft signals work in progress and an immediate bot review adds noise. Lamb agreed they help as an initial pass and pushed for documenting what contributors are expected to do with bot feedback. Sutou Kouhei synthesized the feedback into a PR with a pragmatic split: first-time contributors get one policy, returning contributors another. Alenka Frim asked the practical question nobody had answered, which is when Copilot actually considers itself satisfied with a PR, since nobody had seen it grant an approval. Arrow is writing down norms for AI participation in code review while most projects are still improvising, and other communities will likely copy this homework.&lt;/p&gt;

&lt;p&gt;The format itself saw movement on two fronts. The &lt;a href="https://lists.apache.org/thread/ofnxc1jsymppshbhrtqxtos9dw00wo3y" rel="noopener noreferrer"&gt;arrow.range canonical extension type discussion&lt;/a&gt; wrestled with naming and semantics for bounded ranges, with Felipe Oliveira Carvalho proposing distinct types per boundary closedness, half-open, closed, and the variations between, rather than a single parameterized type. And the &lt;a href="https://lists.apache.org/thread/b9ydqw5bm14htozzn1mxfr240bl2dn0s" rel="noopener noreferrer"&gt;variant type support thread&lt;/a&gt; surfaced a coordination problem: Gang Wu pointed out that several duplicate efforts are underway on variant support in Arrow C++, including work by his colleague Zehua that iceberg-cpp already depends on. Micah Kornfield confirmed community interest and pointed to the freshly opened tracking issue. Duplicate implementations of the same type are wasted effort the dev list exists to prevent, so expect consolidation here.&lt;/p&gt;

&lt;p&gt;The Arrow family also grew. Following the donation vote, Benjamin Philip &lt;a href="https://lists.apache.org/thread/6ww38cgnyq3ly176nrg1wy1o2zwsjnv1" rel="noopener noreferrer"&gt;transferred the Arrow Erlang repository&lt;/a&gt; to the ASF, and Kouhei confirmed it now lives at apache/arrow-erlang with repository setup landing next week. Flight SQL picked up two small protocol wins, with Pedro Matias &lt;a href="https://lists.apache.org/thread/tkpk2c04f7gc73rdo1wmr48mcn8l0x0s" rel="noopener noreferrer"&gt;closing the vote on the is_update field&lt;/a&gt; for prepared statement results with four binding +1s and work proceeding on Go, Java, ADBC, and JDBC implementations, while Richie Black's &lt;a href="https://lists.apache.org/thread/6fjb9dp3j7q3cw0l975bog2n5t7zd82c" rel="noopener noreferrer"&gt;COLUMN_DEF addition to Flight SQL JDBC schema metadata&lt;/a&gt; moved through its own vote. And in a thread that touches Arrow's measurement culture, Rok Mihevc and Jonathan Keane discussed the &lt;a href="https://lists.apache.org/thread/n6hxqojh510b4sgf0ojbmbt98kx82vyo" rel="noopener noreferrer"&gt;status of conbench&lt;/a&gt;, Arrow's continuous benchmarking project, with Mihevc interested in having his agents work on it and Keane happy to see anyone pick it up. The phrase "having your agents work on it" passing without comment in an ASF dev thread says plenty about where 2026 is.&lt;/p&gt;

&lt;p&gt;Arrow's quieter week should not be mistaken for a quiet project. The format has reached the stage where its biggest contributions happen downstream, in arrow-rs powering DataFusion and a growing share of the Rust analytics ecosystem, in ADBC and Flight SQL steadily replacing bespoke wire protocols, and in the C++ library serving as the substrate for iceberg-cpp and the engines built on it. That last dependency is why the variant duplication issue deserves a faster resolution than it might otherwise get. With Iceberg, Parquet, and Spark all converging on variant as the standard answer for semi-structured data, Arrow C++ sits in the critical path for every engine that wants to read shredded variant columns efficiently, and two parallel implementations means review attention split exactly where the ecosystem can least afford it. Wu naming the problem publicly, with a disclaimer about his colleague's involvement, is the dev list doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet packed 72 emails into just 7 threads, and one of them was the week's heavyweight across the entire lakehouse ecosystem.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://lists.apache.org/thread/5nx8r1y2qyotvg9ov5pl99dl498twt7m" rel="noopener noreferrer"&gt;Future of Parquet Versioning discussion&lt;/a&gt; ran to 40 messages and pulled in nearly everyone who matters to the format: Ed Seidl, Andrew Lamb, Antoine Pitrou, Micah Kornfield, Daniel Weeks, Russell Spitzer, Ryan Blue, Fokko Driesprong, and Andrew Bell. The thread got off to an inauspicious start when the Google Doc anchoring the discussion started throwing terms of service violations for Seidl, Lamb, and others, an ironic argument for keeping foundational decisions in plain text on the mailing list. The substance is the question Parquet has deferred for a decade: what does a version number actually promise? Bell asked the question every practitioner asks, which is how a reader knows it has the tooling to read a given file, and what the hesitation is to simply bump version numbers. Seidl's answer exposed the uncomfortable status quo: today there is no in-use mechanism beyond parsing the created_by string, which means readers infer capabilities from writer name-dropping. The debate continues over whether Parquet should adopt feature flags, real version increments, or some hybrid, and the outcome will define how the format evolves for its second decade.&lt;/p&gt;

&lt;p&gt;The reason this debate is happening now, rather than five years ago, is that Parquet's roadmap has filled up with changes that strain the old informal model. Variant types, geometry types, new statistics, the footer redesign, and dense encodings are all arriving in a short window, and each one forces the same question of how a reader discovers it can safely consume a file. The created_by approach worked when two or three writers dominated and everyone could memorize each other's quirks. With a dozen serious implementations across Java, C++, Rust, Go, and Python, capability discovery by string parsing is a correctness bug waiting to happen at every reader-writer pairing. The versioning thread is really an interoperability thread wearing a version number costume, and the contributors arguing in it know that whatever mechanism wins must serve files that will still be read decades from now. Formats outlive engines, and they outlive companies. That is precisely why 40 messages of careful argument is time well spent.&lt;/p&gt;

&lt;p&gt;Lamb attacked the same problem from the documentation side. Convinced by recent discussions that the community must document what V1 and V2 actually mean, messy reality included, he spent several days producing a &lt;a href="https://lists.apache.org/thread/0jwhc6bdwptlormb4xpk07hnzfyz4p6p" rel="noopener noreferrer"&gt;feature-by-version documentation page&lt;/a&gt;. Pitrou pushed back with a precise objection: the page invents an a posteriori meaning for V1 and V2, and he questioned why parquet-format 2.0.0 deserves to be singled out as a meaningful boundary. Lamb conceded that earlier drafts did try to invent definitions and revised toward describing what shipped rather than what the labels should have meant. This exchange is the versioning debate in miniature. The community is discovering that before it can design future versioning, it has to agree on a truthful account of past versioning.&lt;/p&gt;

&lt;p&gt;While the philosophy unfolded, the release train kept moving. Gang Wu confirmed in the &lt;a href="https://lists.apache.org/thread/n0949bqh4dgjhmqym9kkv5y277zk0n0y" rel="noopener noreferrer"&gt;2.13.0 release discussion&lt;/a&gt; that making ColumnMetaData.path_in_schema optional needs more discussion and will not block the release, with Fokko Driesprong and Kornfield agreeing to proceed. The &lt;a href="https://lists.apache.org/thread/7kjqsz7n8cwqpgfo2h9c5q0csml77d86" rel="noopener noreferrer"&gt;vote on Apache Parquet Format 2.13.0 RC0&lt;/a&gt; collected binding +1s from Kornfield and others, with Seidl's vote carrying the best line of the week: we have waited long enough for usable float statistics. Sortable floating point statistics have been a known gap for years, and 2.13.0 finally closes it.&lt;/p&gt;

&lt;p&gt;The footer redesign work formalized its process. Jiayi Wang scheduled &lt;a href="https://lists.apache.org/thread/vz2n5qkkl4godby448lznc36sv9jxhgj" rel="noopener noreferrer"&gt;session 2 of the Parquet Footer Working Group&lt;/a&gt;, moving to a biweekly cadence, and Pitrou immediately raised the governance flag: for a change as foundational as the footer, decisions cannot be made in sync calls and merely reported to the list afterward. Wang agreed without hesitation, committing that syncs will inform but the mailing list will decide. Given that the footer working group is rethinking how every Parquet reader on earth bootstraps file access, insisting on mailing list primacy is not process pedantry. It is how the ASF model protects a format that multiple competing vendors depend on.&lt;/p&gt;

&lt;p&gt;Two type system proposals advanced. Burak Yavuz moved the &lt;a href="https://lists.apache.org/thread/m5hvh3mdgjl4482ws09wfzosotf01kqq" rel="noopener noreferrer"&gt;new File logical type proposal&lt;/a&gt; from design doc to pull requests against parquet-format and the reference implementation, after the Parquet sync aligned on keeping the field simple and minimalistic. Daniel Weeks followed up with additional context from the sync discussion. A File logical type gives engines a standard way to represent file references inside Parquet data, which matters for multimodal and document-heavy workloads where tables increasingly point at external binary content. And Divjot Arora closed the loop on the long-running &lt;a href="https://lists.apache.org/thread/9zl109s34zzzhjlnvls4g8mobb2hydcy" rel="noopener noreferrer"&gt;INT96 statistics question&lt;/a&gt;, announcing the community has settled on introducing a new ColumnOrder to signal statistics validity for INT96 columns. Seidl endorsed it immediately, noting a new ColumnOrder is far preferable to parsing created_by strings, and offered a Rust proof of concept once the format PR lands. Notice the pattern: two separate threads this week independently identified created_by string parsing as the anti-pattern to eliminate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache DataFusion
&lt;/h2&gt;

&lt;p&gt;DataFusion makes its second appearance in this newsletter with a lighter week by volume, 13 emails across 6 threads, but the quality of its release process was on full display.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://lists.apache.org/thread/lxr1tbtz329zz3lykjoxttl7ypch71sx" rel="noopener noreferrer"&gt;vote on Apache DataFusion 54.0.0 RC1&lt;/a&gt; featured the kind of drama that proves verification works. Matt Butrovich cast a -1 after Comet, the Spark accelerator built on DataFusion, showed large performance regressions on TPC-H and TPC-DS at scale factor 1000 that appeared related to Parquet metadata parsing. Andrew Lamb connected it to a similar report from Adam in the Vortex project tied to new metadata cache size limits. Butrovich investigated further, found Adam's issue went through the ListingTable API that Comet does not use, could not reproduce the regression in DataFusion alone, and retracted his -1 while deferring the Comet upgrade for more investigation. Lamb then &lt;a href="https://lists.apache.org/thread/wgtdp9nrbh8p14clf08c5t9wj3q51ro4" rel="noopener noreferrer"&gt;announced the release approved&lt;/a&gt; with 11 +1 votes, 7 binding. A downstream consumer running thousand-scale-factor benchmarks against a release candidate and the project taking the result seriously is exactly how the Rust data stack has earned its reputation.&lt;/p&gt;

&lt;p&gt;Lamb also submitted the &lt;a href="https://lists.apache.org/thread/09y1l7f10o22dx393ln7y9wnl48soblx" rel="noopener noreferrer"&gt;ASF board report&lt;/a&gt; after crowdsourcing input from the community, and opened the &lt;a href="https://lists.apache.org/thread/x2k66nv46289ofcnlntcrv0gy83w1g8g" rel="noopener noreferrer"&gt;2026 Q3-Q4 roadmap discussion&lt;/a&gt; with a tracking ticket inviting the community to say where it wants the project to go. Recognition arrived from inside the foundation too, with Rich Bowen inviting the project to a &lt;a href="https://lists.apache.org/thread/195rmvn2jzyclxsk5243gt5bs4xf1771" rel="noopener noreferrer"&gt;PlusOne.apache.org interview&lt;/a&gt;, citing the 54.0 release, the new Java bindings, and a remarkable growth trajectory. Meanwhile Bob Thomson's infra review brought good news on the resource front: &lt;a href="https://lists.apache.org/thread/znby696kvqb31vbybdysko251mntqb4g" rel="noopener noreferrer"&gt;DataFusion has dropped out of the top consumers&lt;/a&gt; of ASF shared GitHub runners after recent CI optimization work that Oleks V. helped drive, the same week Iceberg learned it now tops that list. One project's playbook is sitting right there for the other to borrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;The week's loudest theme is that format governance is becoming as important as format features. Parquet's 40-message versioning debate, Pitrou's insistence that footer decisions happen on the list rather than in syncs, Iceberg's question about when a draft spec in git becomes a finalized spec, and even the Polaris merge button thread are all the same conversation: as these projects become load-bearing infrastructure for the industry, the process by which they change matters as much as the changes themselves. Two separate Parquet threads independently named created_by string parsing as the failure mode to engineer away, which is what happens when a format relies on convention where it needs specification. Iceberg's day transform clarification, prompted by a real interoperability bug between Spark-written and Go-compacted tables, is the same lesson at smaller scale.&lt;/p&gt;

&lt;p&gt;The variant type is now a genuinely cross-project effort, and this week showed both its promise and its coordination cost. Iceberg contributors aligned on shredding policy across implementations, Arrow surfaced duplicate variant implementations in C++ that need consolidation, and iceberg-cpp already depends on one of them. Semi-structured data support is arriving across the whole stack at once, which is exactly why the alignment syncs Neelesh Salian is running matter. Metadata efficiency formed a third connective thread: Iceberg proposals to cut GET calls and store footer sizes, the Parquet footer working group rethinking file bootstrap, and a DataFusion release candidate nearly held up by metadata cache behavior all point at the same bottleneck. The data files are fast. The metadata round trips are the tax everyone is now optimizing.&lt;/p&gt;

&lt;p&gt;Finally, AI is quietly becoming part of how these communities work. Arrow is writing policy for Copilot reviews, Russell Spitzer prototyped Polaris test infrastructure with Claude's help, Iceberg syncs circulate AI-assisted summaries, and Rok Mihevc casually offered his agents for conbench maintenance. None of this was framed as remarkable by the participants, which is the remarkable part.&lt;/p&gt;

&lt;p&gt;For practitioners, the week distills into three watch items. First, if you operate REST catalogs or pin client versions in production, the v2 loadTable and capabilities outcome will eventually reach your upgrade planning, so the time to read that thread is before the vote rather than after. Second, the metadata efficiency work across Iceberg and Parquet signals that small file performance on object storage is getting first-class attention at the format level, which may relieve pressure on some of the compaction gymnastics teams perform today, even though compaction remains essential for the foreseeable future. Third, the float statistics fix in parquet-format 2.13.0 and the INT96 ColumnOrder decision both close long-standing correctness gaps in predicate pushdown, and engines will pick these up over the coming release cycles, so expect quiet query performance improvements on float-heavy datasets without changing a line of your own code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;Watch for the Iceberg C++ 0.3.0 RC3 result and the outcome of the v2 loadTable capabilities debate, which will shape REST catalog evolution well beyond this release cycle. Polaris 1.6.0 branches around June 26, the Parquet footer working group reconvenes June 23 with its mailing-list-first commitment in place, and the parquet-format 2.13.0 vote should close with float statistics finally fixed. The Iceberg patch releases 1.11.1 and 1.10.3 should move to votes shortly, and the Parquet versioning thread shows no sign of slowing down. The Iceberg variant shredding alignment and the Arrow C++ variant consolidation are worth tracking as a pair, since the semi-structured data story only works if both layers land compatible implementations. On the community calendar, Munich hosts the Iceberg Europe meetup July 22, Lakehouse Day EU registration is open for Glasgow in October, and the Iceberg Summit 2027 location conversation is just getting started, with European cities making an energetic early case. If the past week is any guide, the next one will be busy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-11&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; — Build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-11&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; — Learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/ref=sr_1_5?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-5" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/ref=sr_1_7?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_DP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-7" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/ref=sr_1_9?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_DP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-9" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/ref=sr_1_16?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_DP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-16" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Apple Goes Agentic: AI Week of June 4-11, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 11 Jun 2026 19:34:55 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apple-goes-agentic-ai-week-of-june-4-11-2026-13mm</link>
      <guid>https://dev.to/alexmercedcoder/apple-goes-agentic-ai-week-of-june-4-11-2026-13mm</guid>
      <description>&lt;p&gt;Apple rebuilt its developer stack around AI agents at WWDC 2026 this week. At the same time, Microsoft's new coding model reached real users, a supply chain attack hit 13 AI coding tools, and the protocol layer under all of it kept spreading. Here is what happened from June 4 to June 11 and why it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: The IDE Becomes an Agent Workbench
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Xcode 27 brings coding agents to Apple development
&lt;/h3&gt;

&lt;p&gt;Apple seeded the first Xcode 27 beta, build 27A5194q, to registered developers on June 8, right after the WWDC 2026 keynote. The release turns Apple's IDE into a full agent workbench. &lt;a href="https://www.iclarified.com/101143/apple-unveils-new-ai-frameworks-and-agentic-coding-in-xcode-27" rel="noopener noreferrer"&gt;Xcode 27 integrates coding agents from Anthropic, Google, and OpenAI&lt;/a&gt; directly into the development workflow.&lt;/p&gt;

&lt;p&gt;The conference itself carried extra weight this year. &lt;a href="https://www.cnbc.com/2026/06/08/apple-wwdc-2026-live-updates.html" rel="noopener noreferrer"&gt;Tim Cook closed the keynote with a farewell to developers ahead of his move to executive chairman, and hardware chief John Ternus takes over as CEO on September 1&lt;/a&gt;. The keynote introduced iOS 27 and macOS Golden Gate, both arriving this fall. So the agentic developer stack described below is the platform Apple's next CEO inherits on day one, and it tells you where the company plans to compete.&lt;/p&gt;

&lt;p&gt;The architecture uses two engines. A local model runs on the Apple Silicon Neural Engine and handles inline code completion in real time. &lt;a href="https://www.techtimes.com/articles/318045/20260609/xcode-27-device-ai-code-completion-uses-neural-engine-skips-cloud-entirely.htm" rel="noopener noreferrer"&gt;No source code leaves the machine for these suggestions&lt;/a&gt;. Heavier work routes to cloud agents from Anthropic, Google, or OpenAI, and only after an explicit developer opt-in. That split answers the most common enterprise objection to AI coding tools. Day-to-day completion stays private by default, and cloud access becomes a deliberate choice.&lt;/p&gt;

&lt;p&gt;The agent capabilities go well past autocomplete. Agents in Xcode 27 plan work across multiple turns, write and run tests, try ideas in isolation with Playgrounds, and inspect visual changes through live previews. A new Device Hub lets agents operate the iOS Simulator and physical devices from a single workspace. A canvas renders Markdown, code changes, and previews side by side during agent conversations. The agent validates its own work, so it runs autonomously for longer stretches without a human checking every step.&lt;/p&gt;

&lt;p&gt;The release also marks a hard platform break. Xcode 27 runs only on Apple Silicon, and the application binary shrank 30 percent compared to Xcode 26. Apple tied its developer tools to its own chips at the exact moment those chips became the local inference engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apple ships its own agent skills in the toolchain
&lt;/h3&gt;

&lt;p&gt;One detail got buried under the Siri headlines and deserves attention from anyone building with agents. &lt;a href="https://dev.to/arshtechpro/wwdc-2026-xcode-27-ships-with-apples-own-agent-skills-what-they-are-and-how-to-use-them-3g2"&gt;Xcode 27 ships with seven agent skills that Apple wrote itself&lt;/a&gt;. The bundled set includes swiftui-specialist for idiomatic SwiftUI, swiftui-whats-new-27 for the newest APIs agents have barely seen in training, uikit-app-modernization for moving old UIKit code forward, test-modernizer for updating test code, and audit-xcode-security-settings for reviewing project security.&lt;/p&gt;

&lt;p&gt;The significance sits in who authored them. Until now, skill files for coding agents came from the community or from individual teams. Apple now ships first-party guidance in the toolchain, written by the people who built the frameworks. When an agent modernizes UIKit code in Xcode 27, it draws on instructions from the framework's own authors. Expect other platform vendors to copy this pattern fast. Skills are becoming a standard part of what a platform ships, not an add-on.&lt;/p&gt;

&lt;p&gt;Developers can extend Xcode with custom skills and plug-ins as well. The plug-in system connects outside tools through the Model Context Protocol, which we cover in the standards section below.&lt;/p&gt;

&lt;h3&gt;
  
  
  MAI-Code-1-Flash reaches the Copilot model picker
&lt;/h3&gt;

&lt;p&gt;Microsoft announced its homegrown MAI model family at Build on June 2. This week the models started reaching actual users. &lt;a href="https://developer.microsoft.com/blog/build-recap" rel="noopener noreferrer"&gt;MAI-Code-1-Flash is rolling out to Copilot Free, Student, Pro, Pro+, and Max plans&lt;/a&gt;, starting with a limited set of users and expanding gradually. Developers select it from the model picker in Visual Studio Code.&lt;/p&gt;

&lt;p&gt;The Flash variant targets fast, low-cost coding tasks. &lt;a href="https://www.testingcatalog.com/microsoft-build-2026-recap-from-windows-to-copilot-all-ai/" rel="noopener noreferrer"&gt;Microsoft pitches it above Claude Haiku 4.5 on price-to-performance&lt;/a&gt;. The full MAI-Code-1 model, tuned for GitHub and VS Code, &lt;a href="https://blogs.microsoft.com/blog/2026/06/02/microsoft-build-2026-be-yourself-at-work/" rel="noopener noreferrer"&gt;is now available in Copilot&lt;/a&gt;. Microsoft also committed to distributing MAI models through Fireworks AI, Baseten, and OpenRouter, which signals the company wants these models judged on the open market rather than inside its own products alone.&lt;/p&gt;

&lt;p&gt;The strategic read is simple. Microsoft spent three years reselling OpenAI's models inside GitHub Copilot. Now it owns a coding model, controls its costs, and prices it against the cheapest tier of the competition. Watch the model picker telemetry over the next quarter. If developers stick with MAI-Code-1-Flash for routine tasks, Microsoft's inference bill drops and its bargaining position improves.&lt;/p&gt;

&lt;p&gt;The coding model arrived alongside a reasoning sibling that frames Microsoft's ambition. &lt;a href="https://www.testingcatalog.com/microsoft-build-2026-recap-from-windows-to-copilot-all-ai/" rel="noopener noreferrer"&gt;MAI-Thinking-1 is a 35-billion-parameter reasoning model with a 256K context window&lt;/a&gt; that Microsoft says it built without distillation. The company claims blind raters prefer it to Claude Sonnet 4.6 and that it matches Claude Opus 4.6 on SWE-Bench Pro. It sits in private preview on Azure AI Foundry behind an access request, aimed at enterprise buyers. Treat vendor benchmark claims with the usual caution, but note the posture. Microsoft now publishes head-to-head numbers against the models it resells.&lt;/p&gt;

&lt;h3&gt;
  
  
  Miasma attack hits 13 AI coding tools
&lt;/h3&gt;

&lt;p&gt;The week brought a sharp reminder that AI coding tools are now attack surface. Security firm SafeDep published a teardown of Miasma, a supply chain attack toolkit that &lt;a href="https://aiweekly.co/alerts/miasma-hits-13-ai-coding-tools-hides-c2-in-github" rel="noopener noreferrer"&gt;targets 13 different AI coding tools through config-file injection&lt;/a&gt;. The toolkit hides its command-and-control infrastructure inside GitHub itself rather than on traditional servers, which makes takedowns and IP blocking far less effective.&lt;/p&gt;

&lt;p&gt;The self-replication mechanism is the nasty part. Each compromised account leaks fresh credentials into public commits. The next victim harvests those credentials, and the infection spreads with the developer ecosystem instead of with attacker effort. The attack works because developers now grant elevated trust and full codebase access to AI assistants. Config files for those assistants became a high-value injection point that did not exist at scale two years ago.&lt;/p&gt;

&lt;p&gt;The practical takeaway for data and platform teams: treat agent config files like production code. Review them in pull requests, pin them, and scan them. An agent with repo access and a poisoned config is an insider threat. SafeDep's analysis points to three concrete defenses. Restrict which config files coding agents read on developer machines. Audit GitHub tokens for scope creep, since the toolkit feeds on over-permissioned credentials. And monitor public commits from your organization for secrets, because the attack turns every leak into a new infection vector.&lt;/p&gt;

&lt;p&gt;Step back and the coding tools picture for the week is clear. Apple, Microsoft, and JetBrains all shipped or advanced agent-native IDE work inside seven days. The evaluation question for engineering leaders changed with them. The old question asked which assistant writes the best code. The new questions ask where the model runs, what the agent can touch, how its tool access is governed, and how its config surface is secured. Pick tools on those answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: Private Inference Goes Multi-Cloud
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Apple extends Private Cloud Compute to Google Cloud
&lt;/h3&gt;

&lt;p&gt;The biggest infrastructure story of the week came from Cupertino. Apple announced it is &lt;a href="https://security.apple.com/blog/expanding-pcc/" rel="noopener noreferrer"&gt;expanding Private Cloud Compute beyond Apple's own data centers for the first time&lt;/a&gt;. Apple Intelligence workloads now run on Google Cloud, powered by Nvidia GPUs, under Apple's PCC security model.&lt;/p&gt;

&lt;p&gt;The technical stack layers three vendors' silicon-level protections. &lt;a href="https://cryptobriefing.com/nvidia-confidential-computing-apple-google-cloud/" rel="noopener noreferrer"&gt;Nvidia Confidential Computing provides trusted execution environments on Blackwell GPUs, Intel TDX handles CPU-level isolation, and Google contributes its Titan security chip&lt;/a&gt;. Together they create encrypted pathways that block everyone, including Google as the cloud operator, from reading data during processing. Apple keeps full control of the PCC software layer. &lt;a href="https://www.macrumors.com/2026/06/08/apple-private-cloud-compute-google/" rel="noopener noreferrer"&gt;Only cryptographically approved binaries deploy, and Apple maintains a verifiable ledger of every piece of Google Cloud hardware in the PCC fleet&lt;/a&gt; to guard against supply chain tampering.&lt;/p&gt;

&lt;p&gt;The expansion exists to serve a new model tier. Apple introduced AFM Cloud Pro, the largest of the new Apple Foundation Models co-developed with Google on Gemini technology. Apple executives described it as comparable to Google's frontier Gemini models. Agentic tool use and complex reasoning route to this model in the cloud, and the rest stays on device. &lt;a href="https://www.macworld.com/article/3156959/apple-to-use-google-servers-with-nvidia-hardware-for-the-new-siri.html" rel="noopener noreferrer"&gt;Reporting indicates Apple's own PCC hardware ran the new Siri model too slowly in testing&lt;/a&gt;, which pushed the heavy workloads onto Google's Nvidia-equipped infrastructure.&lt;/p&gt;

&lt;p&gt;For data engineers, this is the most instructive confidential computing deployment yet. Apple published the trust model, committed to public inspection of PCC binaries, and shipped attestation across three vendors' hardware. If your organization is designing private inference for regulated data, this architecture is the new reference point.&lt;/p&gt;

&lt;p&gt;The pattern translates directly to data platforms. Agentic analytics puts models in contact with governed tables, customer records, and financial data. The PCC design shows what a defensible answer looks like: attested hardware, a verifiable ledger of every machine in the fleet, signed binaries, and external inspection. Vendors selling "private AI" for the lakehouse now have a public bar to clear. Ask them which of those four properties they actually ship.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-device inference carries the everyday workload
&lt;/h3&gt;

&lt;p&gt;The same week made the opposite point with equal force. The most-used AI feature Apple shipped runs with no cloud at all. Xcode 27's inline completion executes entirely on the Neural Engine, and Apple cut off Intel Macs from both Xcode 27 and macOS Golden Gate because those machines lack the silicon. Server-backed features in iOS 27, including the upgraded Image Playground, carry daily usage limits because they depend on larger cloud models.&lt;/p&gt;

&lt;p&gt;The pattern across the whole WWDC lineup is a deliberate split. Frequent, latency-sensitive, privacy-sensitive tasks run on local silicon. Rare, heavy, agentic tasks run in attested cloud environments. Hardware buyers should plan for both tiers rather than betting on one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intel and Foxconn team up on rack-scale AI systems
&lt;/h3&gt;

&lt;p&gt;Intel kept building its post-Computex momentum. On June 4, Intel and Foxconn announced a partnership to &lt;a href="https://finance.yahoo.com/sectors/technology/articles/intel-foxconn-expand-ai-push-123510968.html" rel="noopener noreferrer"&gt;develop AI chips and rack-scale infrastructure together&lt;/a&gt;. The work spans chips, racks, full systems, and applications. The companies plan rack-scale AI infrastructure built on Intel Xeon processors plus improved interconnect, cooling, and system monitoring.&lt;/p&gt;

&lt;p&gt;The deal matters because the AI buildout has shifted from chips to systems. Power delivery, liquid cooling, and rack integration now gate deployments more than raw FLOPS. Foxconn assembles a huge share of the world's servers, so a tighter Intel-Foxconn loop shortens the path from silicon to installed capacity. It also gives Intel a systems story to tell against Nvidia's vertically integrated racks.&lt;/p&gt;

&lt;p&gt;The announcement extends the rack-scale push Intel started at Computex the week before, where it &lt;a href="https://newsroom.intel.com/artificial-intelligence/intel-announces-new-ai-innovations-at-computex" rel="noopener noreferrer"&gt;paired Xeon processors with SambaNova SN-50 Reconfigurable Dataflow Units for inference and agentic workloads&lt;/a&gt;. The through line across both announcements is inference economics. Training capacity gets the headlines, but agentic workloads run inference all day, every day. The vendors building disaggregated, rack-scale inference systems are betting that serving agents, not training models, becomes the dominant compute bill. For teams budgeting agentic analytics, that bet matches what the workload actually looks like: many small queries, sustained all day, latency-sensitive, and cheaper on purpose-built inference racks than on training-class GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards &amp;amp; Protocols: MCP Moves Into the Operating System
&lt;/h2&gt;

&lt;h3&gt;
  
  
  MCP becomes the IDE's native tongue
&lt;/h3&gt;

&lt;p&gt;The Model Context Protocol crossed a threshold this week. It stopped being a plug-in convention and started becoming part of the platform. Apple ships a binary called mcpbridge in Xcode 27 that &lt;a href="https://www.techtimes.com/articles/318110/20260610/wwdc-2026-day-3-xcode-27-neural-engine-completes-code-without-sending-source-any-server.htm" rel="noopener noreferrer"&gt;translates MCP over XPC into Xcode's live process, turning the IDE into a universal MCP host&lt;/a&gt;. More than 20 tools wire into the Xcode agent through MCP in the first beta. Any MCP-compliant agent can now orchestrate Apple platform development.&lt;/p&gt;

&lt;p&gt;JetBrains is moving the same direction in the same cycle. The IntelliJ IDEA 2026.2 Early Access Program, opened May 27, adds the ability for agents to set breakpoints and logpoints during live debug sessions through MCP, and it exposes more IDE internals through the protocol. Add VS Code's existing MCP surface and the pattern is now consistent across all three major development environments. MCP is the interface between agents and developer tools, and the vendors are building it in natively instead of leaving it to extensions.&lt;/p&gt;

&lt;p&gt;This is what winning looks like for a protocol. The arguments about whether MCP becomes the standard ended. The work now is making each host's MCP surface deep enough to be useful, and that is exactly what Apple and JetBrains shipped this week.&lt;/p&gt;

&lt;h3&gt;
  
  
  Foundation Models lets apps swap AI providers without code changes
&lt;/h3&gt;

&lt;p&gt;Apple's Foundation Models framework grew into something protocol-shaped at WWDC. The framework now exposes a public Swift interface called LanguageModel. Third-party providers implement it to &lt;a href="https://www.techtimes.com/articles/318039/20260609/wwdc-2026-developer-tools-foundation-models-now-swaps-ai-providers-without-code-changes.htm" rel="noopener noreferrer"&gt;expose their cloud models through the same API surface as Apple's on-device models&lt;/a&gt;. Anthropic and Google implement it today. An app written against the protocol switches between Apple's local model, Claude, or Gemini without code changes.&lt;/p&gt;

&lt;p&gt;Two additions sweeten the deal for developers. Dynamic Profiles update model behavior without shipping an app update. And &lt;a href="https://pokde.net/system/software/ai/apple-ai-framework-xcode-27" rel="noopener noreferrer"&gt;developers in the App Store Small Business Program with fewer than 2 million first-time downloads get access to the next-generation Apple Foundation Models on Private Cloud Compute at no cloud API cost&lt;/a&gt;. Free frontier-class inference for small developers is a direct shot at every per-token API business, and it sets a price expectation the rest of the market now has to answer.&lt;/p&gt;

&lt;p&gt;The abstraction matters beyond Apple's ecosystem. Provider-agnostic model interfaces keep appearing at every layer: in IDE model pickers, in gateway products, and now in an OS vendor's first-party SDK. Model lock-in is getting engineered out of the stack, and pricing power shifts toward whoever owns the interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  App Intents becomes the agent surface for apps
&lt;/h3&gt;

&lt;p&gt;Apple also &lt;a href="https://www.techtimes.com/articles/318045/20260609/xcode-27-device-ai-code-completion-uses-neural-engine-skips-cloud-entirely.htm" rel="noopener noreferrer"&gt;replaced SiriKit with App Intents as the way apps expose actions to the assistant&lt;/a&gt;, and the migration clock is now running. App Intents describes what an app can do in a structured, machine-readable way. Siri AI chains those actions across apps with multi-step commands and on-screen awareness.&lt;/p&gt;

&lt;p&gt;Squint and this is the same idea as MCP tools, expressed in Swift. Every platform is converging on the same architecture: a structured catalog of actions, an agent that plans across them, and permissions at the boundary. Teams that already publish clean, well-described actions for one agent surface will find the others cheap to add.&lt;/p&gt;

&lt;p&gt;The consumer side of the same architecture is the new Siri AI, which &lt;a href="https://www.tomsguide.com/news/live/wwdc-2026-live-news-updates" rel="noopener noreferrer"&gt;accepts multi-step chained commands in a single prompt, gains on-screen awareness, and ships as a standalone app&lt;/a&gt;. Siri AI runs on the new Apple Foundation Models built with Google's Gemini technology, and it plans across App Intents the way a coding agent plans across MCP tools. One caveat with real market weight: &lt;a href="https://www.techradar.com/news/live/apple-wwdc-2026-live" rel="noopener noreferrer"&gt;Siri AI does not ship in the European Union with iOS 27&lt;/a&gt;. Hundreds of millions of users sit outside the launch, and app developers in those regions need a strategy for an assistant-shaped hole in the platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  The MCP spec clock keeps ticking toward July 28
&lt;/h3&gt;

&lt;p&gt;The protocol's own roadmap stayed on schedule through all this adoption news. The MCP 2026-07-28 release candidate, &lt;a href="https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/" rel="noopener noreferrer"&gt;the largest revision of the protocol since launch&lt;/a&gt;, is in its ten-week validation window right now. The revision delivers a stateless protocol core that scales on ordinary HTTP infrastructure, an Extensions framework, long-running Tasks, server-rendered MCP Apps, hardened authorization aligned with OAuth and OpenID Connect, and a formal deprecation policy.&lt;/p&gt;

&lt;p&gt;The final specification ships on July 28, and Tier 1 SDKs are expected to land support inside the validation window. The release contains breaking changes, so teams running MCP servers in production should test against the release candidate now rather than discover the breaks in August. The stateless core is the piece infrastructure teams have asked for since the protocol launched. It removes the session-state requirement that complicated load balancing and horizontal scaling.&lt;/p&gt;

&lt;p&gt;Put the week together and the protocol story writes itself. The spec is hardening for production at the same moment the biggest platform vendors are compiling it into their operating systems and IDEs. The connective tissue of agentic software is settling into place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Week in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Apple turned its IDE into an agent host, shipped first-party agent skills, and moved its private inference onto Google Cloud with attestation across Nvidia, Intel, and Google silicon. Microsoft pushed its own coding model into the hands of free-tier Copilot users. A supply chain toolkit proved that agent config files are now a serious attack surface. And MCP showed up in three places at once: the Xcode binary, the JetBrains debugger, and a spec release candidate six weeks from final. Agents stopped being a product category this week. They became a platform layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt;: Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-11-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt;: Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-11-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt;: Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-11-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt;: Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt;: A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>coding</category>
      <category>news</category>
    </item>
    <item>
      <title>The Best Data Lakehouse Tools for Apache Iceberg in 2026: A Complete Breakdown</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 11 Jun 2026 05:40:25 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/the-best-data-lakehouse-tools-for-apache-iceberg-in-2026-a-complete-breakdown-5fd</link>
      <guid>https://dev.to/alexmercedcoder/the-best-data-lakehouse-tools-for-apache-iceberg-in-2026-a-complete-breakdown-5fd</guid>
      <description>&lt;p&gt;&lt;em&gt;By Alex Merced, Head of Developer Relations at Dremio and author of books on Apache Iceberg, Apache Polaris, and data lakehouses.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The question is no longer whether to build on Apache Iceberg. That fight is over. Iceberg Summit 2026, held at the San Francisco Marriott Marquis on April 8 and 9, drew over 600 attendees across two days and more than 70 sessions, and not a single talk tried to convince anyone to adopt Iceberg. Snowflake, Databricks, AWS, Google, and Microsoft all read and write the format, and the open source engines treat it as the default. The interesting decisions have moved up the stack. Once your tables live in Iceberg, which tools do you point at them, and how do you keep them all working on one copy of data instead of fragmenting your estate into silos again?&lt;/p&gt;

&lt;p&gt;This article breaks down thirteen data lakehouse tools and where each one shines or falls short in an Apache Iceberg lakehouse: Dremio, Snowflake, Databricks, Apache Spark, Apache Flink, Apache Fluss, Bauplan, Spice.ai, StarRocks, DuckDB, Apache DataFusion, ClickHouse, and LakeSail. Two ideas run through the entire piece. First, the Iceberg REST catalog is the connective tissue that makes a multi-engine lakehouse real, and treating it as a first-class requirement is what lets you use each of these tools where it is genuinely best. Second, a lakehouse needs a center of gravity: a platform that owns the catalog, the SQL engine, the semantic layer, and the autonomous table maintenance so the rest of the ecosystem can plug in without turning into a science project. That center is where Dremio fits.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bapel5pigumsokindy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bapel5pigumsokindy2.png" alt="The Apache Iceberg Lakehouse Ecosystem" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Iceberg REST Catalog Decides Everything
&lt;/h2&gt;

&lt;p&gt;Before getting to individual tools, you need to understand the layer that ties them together. An Iceberg catalog is the top level of the table architecture. It stores the current metadata pointer for every table, performs the atomic swap that turns one snapshot into the next, and acts as the single API boundary between every engine and every byte of data you own. All read and write operations, even from different engines, route through the catalog.&lt;/p&gt;

&lt;p&gt;The breakthrough is the Iceberg REST catalog specification. Instead of every engine implementing a different catalog SDK, the REST protocol gives all of them one standard interface. What started as a convenience layer has become the connective tissue of the open lakehouse: JVM-based or not, any engine can interact with Iceberg tables through a common interface. This is the thing that makes "Spark for ingestion, Snowflake or Trino for queries, DuckDB for local work" an established architecture rather than an aspiration.&lt;/p&gt;

&lt;p&gt;Apache Polaris is the open source, vendor-neutral implementation of that specification. It graduated to a top-level Apache project on February 18, 2026, after incubating since August 2024. In roughly 18 months of incubation, the project shipped six releases (0.9.0 through 1.3.0), closed over 2,800 pull requests, and attracted around 100 contributors. Polaris implements Iceberg's REST API to enable multi-engine interoperability across Apache Doris, Apache Flink, Apache Spark, Dremio, StarRocks, and Trino. It was co-created by Dremio, and Dremio engineer Jean-Baptiste Onofré, who shepherded Polaris through incubation, was elected to the Apache Software Foundation Board of Directors at the 2026 Annual Members' Meeting, with the board effective March 6, 2026.&lt;/p&gt;

&lt;p&gt;Keep this in mind as you read each tool's assessment. The REST catalog is what lets you avoid choosing one engine forever. A tool that speaks REST fluently is a tool you can adopt without regret. A tool that only reads Iceberg through path-based access, or locks its tables behind a proprietary catalog, is a tool that constrains your future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dremio: The Iceberg-Native Center of the Lakehouse
&lt;/h2&gt;

&lt;p&gt;Dremio is the platform built natively on Apache Iceberg, Apache Polaris, and Apache Arrow from the ground up rather than retrofitted onto an existing engine. In April 2026 Dremio shipped general availability of Iceberg V3 support in Dremio Cloud, with full read and write support for the latest specification. As CTO Rahim Bhojani put it, "Most platforms added Iceberg as a feature, but Dremio was built on it from the ground up."&lt;/p&gt;

&lt;p&gt;What separates Dremio in an Iceberg lakehouse is the combination of capabilities that compound on each other. Its query engine is built natively on Apache Arrow, the open columnar standard Dremio co-created, so it processes Iceberg and Parquet data in vectorized batches without converting to a proprietary internal format. This is a real architectural difference: engines that convert Iceberg into an internal format pay a translation tax on every query, and they cannot write their acceleration structures back as open Iceberg data.&lt;/p&gt;

&lt;p&gt;On top of the engine, Dremio adds Autonomous Reflections, which observe query patterns over a rolling window and automatically create, refresh, and retire materializations that accelerate queries from seconds to sub-second with no code changes or manual tuning. These Reflections are themselves Iceberg materializations, not a proprietary cube format. Dremio also runs Iceberg Clustering using Z-order to co-locate data across multiple columns, with two-level pruning that skips data at both the manifest and row-group level, running continuously on petabyte-scale tables without full-table rewrites. Compaction, snapshot expiration, and orphan file cleanup run on policy-based schedules with no manual intervention.&lt;/p&gt;

&lt;p&gt;For V3 specifically, Dremio supports the VARIANT type for semi-structured JSON, deletion vectors for faster CDC and streaming workloads, and row-level lineage for regulated industries with no additional tooling.&lt;/p&gt;

&lt;p&gt;The catalog story is where Dremio's positioning becomes clear. Dremio Open Catalog is managed Apache Polaris, provisioned the moment you start, giving you RBAC, credential vending, and the full Iceberg REST spec without operating a JVM service. Dremio extends it with fine-grained access control through UDFs for row-level security and column masking that travel with the data across every access path, not just inside one engine. Crucially, Dremio charges only for compute run through Dremio itself, so you can use external engines against the same catalog freely. Among the commercial catalog options on the market, that combination is rare: full Iceberg REST compatibility, native read and write access from any engine, and automated optimization, all without locking you into one compute layer.&lt;/p&gt;

&lt;p&gt;Dremio also brings federation. Its query engine connects PostgreSQL, Snowflake, BigQuery, Glue, and Unity Catalog into the same governed namespace, so you can query data where it lives and gradually move it into Iceberg without a migration day. Add the AI Semantic Layer and built-in MCP-based agent connectivity, and Dremio becomes the layer that human analysts and AI agents both query through one consistent interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snowflake: A Warehouse That Learned to Open Up
&lt;/h2&gt;

&lt;p&gt;Snowflake spent its first decade as a closed, high-performance cloud data warehouse, and it has spent the last few years opening to Iceberg with real commitment. At Snowflake Summit 2026 in early June, the catalog news sat at the center of the keynote. Snowflake made Apache Iceberg V3 support generally available, claiming the broadest V3 feature coverage of any platform, including deletion vectors, row lineage, the VARIANT type, and default values. Snowflake Storage for Apache Iceberg Tables also reached general availability, letting organizations keep a single live, governed copy of data across Snowflake and external lakes.&lt;/p&gt;

&lt;p&gt;The interoperability mechanics are worth understanding. Snowflake Open Catalog is its managed Apache Polaris service, generally available and free today, with pay-per-request billing planned for later in 2026. Separately, Horizon Catalog now exposes the Iceberg REST API (the Horizon Iceberg REST Catalog API) so external engines like Apache Spark can read Snowflake-managed Iceberg tables, with that capability reaching general availability in February 2026. Horizon Catalog, powered by Apache Polaris, supports bi-directional read and write access from external engines. API requests are billed as 0.5 credit per million calls, charged as Cloud Services, with billing scheduled to begin in mid-2026.&lt;/p&gt;

&lt;p&gt;Where Snowflake shines: if you already run Snowflake, you now get governed Iceberg tables that other engines can reach, plus a strong governance and semantic story through Horizon Context. The performance on Snowflake-native workloads remains excellent.&lt;/p&gt;

&lt;p&gt;Where it falls short in an Iceberg lakehouse context: Snowflake's query engine is proprietary and converts Iceberg into its internal representation rather than processing it natively, and its query acceleration features are built for Snowflake's own format, not for your open Iceberg tables. For external-catalog tables, Snowflake is largely a reader. And the per-request billing model on the REST catalog is a cost dimension you do not face with a self-managed Polaris or with Dremio's compute-only pricing. Snowflake is a strong consumer and an increasingly capable producer of Iceberg, but it pulls toward keeping the center of gravity inside Snowflake. In a multi-engine lakehouse, treat it as one powerful engine among several, connected through the REST catalog, rather than the hub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks: Full Iceberg Support, Delta Heritage
&lt;/h2&gt;

&lt;p&gt;Databricks built its business on Delta Lake and Spark, and in 2026 it has made a serious move to embrace Iceberg as a first-class format inside Unity Catalog. The company announced general availability of Managed Iceberg, Iceberg V3, and Foreign Iceberg. Managed Iceberg lets you create, read, write, optimize, govern, and share Iceberg tables directly in Unity Catalog, with Predictive Optimization and Liquid Clustering automating performance tuning. Unity Catalog provides an implementation of the Iceberg REST Catalog API at the endpoint /api/2.1/unity-catalog/iceberg-rest, so Spark, Flink, Trino, and other REST clients can read and write managed Iceberg tables, with credential vending for scoped storage access.&lt;/p&gt;

&lt;p&gt;Foreign Iceberg, now GA, lets Unity Catalog govern Iceberg tables managed in other catalogs such as AWS Glue, Snowflake Horizon, and Hive Metastore, positioning Unity Catalog as a single pane of glass. Databricks also enables Iceberg reads on Delta tables (formerly UniForm) so a single copy of data files can serve both formats, and it has made Iceberg a first-class citizen in Delta Sharing.&lt;/p&gt;

&lt;p&gt;Where Databricks shines: for teams already on Databricks, this is a genuine path to open tables, with strong AI-driven optimization and a mature governance layer. The Iceberg REST Catalog API support means external engines including Dremio, Trino, DuckDB, Daft, and Spark can reach managed tables.&lt;/p&gt;

&lt;p&gt;Where it falls short: Databricks remains Delta-optimized at its core, and its engine treats Iceberg through a compatibility lens rather than as the native format. Foreign Iceberg tables are read-only inside Databricks, and several governance and access-control capabilities are still in beta. The proprietary query acceleration is built for Databricks, not for portable Iceberg materializations. As with Snowflake, the gravitational pull is toward keeping workloads inside the platform. It is a strong Iceberg participant, but its DNA is Delta and Spark, and that shows in where the optimizations are pointed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Spark: The Workhorse for Ingestion and Heavy ETL
&lt;/h2&gt;

&lt;p&gt;Apache Spark is the most battle-tested distributed processing engine in the ecosystem, and it has the deepest Iceberg integration of any compute framework. Iceberg ships a dedicated Spark runtime for each supported version, and the 1.11.0 release of Iceberg in May 2026 added support for Apache Spark 4.1 and made it a default build target. Spark 4.0 brought VARIANT data type support that pairs with Iceberg V3, ANSI SQL compliance, and SQL scripting. On Amazon EMR, Spark 4.0.2 reached general availability in May 2026 with Iceberg V3 support, fine-grained access control, and VARIANT types.&lt;/p&gt;

&lt;p&gt;Spark's Iceberg connector supports the full range of DML, including MERGE INTO with automatic schema evolution in the newer syntax, row-level operations, and the latest V3 features like deletion vectors and row lineage. For large batch transformations, backfills, and the kind of heavy lifting that builds and maintains big Iceberg tables, Spark is the default choice.&lt;/p&gt;

&lt;p&gt;Where Spark shines: ingestion at scale, complex multi-stage ETL, and as a writer that produces clean Iceberg tables any other engine can then read through the REST catalog. It connects to Polaris, Unity Catalog, Snowflake Horizon, Glue, and any REST catalog.&lt;/p&gt;

&lt;p&gt;Where it falls short: Spark is not an interactive, low-latency query engine. It carries JVM overhead, cluster startup time, and operational complexity. Pointing Spark at customer-facing dashboards is using the wrong tool. The guidance from the field is direct: instead of trying to make low-latency queries work with Spark, use an engine suited to that task and let Spark do the batch work. In an Iceberg lakehouse, Spark is the muscle for writing and transforming, while interactive serving belongs elsewhere. The REST catalog is exactly what lets you split those roles cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Flink: Streaming Into the Lakehouse
&lt;/h2&gt;

&lt;p&gt;Apache Flink is the engine for stateful stream processing, and its Iceberg integration has matured into a serious real-time ingestion path. Flink's checkpoint mechanism provides exactly-once delivery guarantees to Iceberg, and it supports changelog streams for CDC patterns, writing inserts, updates, and deletes as Iceberg data and delete files. Iceberg 1.11.0 added initial Apache Flink 2.1 support.&lt;/p&gt;

&lt;p&gt;The standout is the Dynamic Iceberg Sink, which reached production-ready status with Iceberg 1.10.0 and works with Flink 1.20, 2.0, and 2.1. It breaks the old one-sink-per-table model: a single sink routes each record to a table chosen at runtime, creating tables on demand and evolving their schemas and partition specs on the fly without a job restart. For a team ingesting dozens of Kafka topics into dozens of Iceberg tables where new fields appear regularly, this is a major reduction in operational burden.&lt;/p&gt;

&lt;p&gt;Where Flink shines: low-latency streaming with exactly-once semantics, CDC into Iceberg, and stateful processing like windowed aggregations and stream joins. It is the right tool when you need stateful stream processing, not just event delivery.&lt;/p&gt;

&lt;p&gt;Where it falls short: Flink is heavy infrastructure for simple topic-to-table ingestion, where a Kafka Connect sink may be enough. More importantly, the most common mistake in streaming Iceberg architectures is deploying the stream processor without a compaction service. Flink's frequent checkpoints produce many small files, and without compaction running alongside, query performance degrades within days. This is precisely where a platform with autonomous compaction (Dremio, or a managed catalog) complements Flink: Flink writes the stream, and the lakehouse keeps the tables query-ready. Through the REST catalog, the Flink-written tables are immediately available to every other engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Fluss: Streaming Storage Built for Iceberg
&lt;/h2&gt;

&lt;p&gt;Apache Fluss is one of the newest entries, an incubating Apache project that acts as a streaming storage layer purpose-built for real-time analytics and the lakehouse. Named from the German word for river, Fluss targets the gap between Kafka-style streaming storage and Iceberg-style analytical storage. Its 0.8 release added full support for Apache Iceberg through the Streaming Lakehouse for Apache Iceberg (FIP-3), which positions Fluss as the real-time ingestion and storage layer that writes fresh data and updates into Iceberg with guaranteed ordering and exactly-once semantics.&lt;/p&gt;

&lt;p&gt;The architecture is genuinely interesting. Fluss uses zero-copy tiering to Iceberg: recent data stays on Fluss servers using the Kafka replication protocol for durability, then tiers to Iceberg for long-term storage, resulting in one copy of data rather than the two copies you get with approaches that materialize Kafka topics into a separate Iceberg table. A stateless tiering service moves data from hot to cold based on configured freshness, and query engines see a single table spanning both tiers. Fluss 0.8 also introduced Delta Joins with Flink, externalizing join state into Fluss tables to cut state and checkpoint durations dramatically.&lt;/p&gt;

&lt;p&gt;Where Fluss shines: unifying streaming and lakehouse storage in a Kappa-style architecture, eliminating the dual-write problem, and giving Iceberg a real-time front door. It deliberately optimizes Iceberg for streaming bootstrap efficiency by maintaining stream-order in the table.&lt;/p&gt;

&lt;p&gt;Where it falls short: Fluss is young and still incubating, so production maturity and ecosystem support are limited compared to established tools. It is tightly coupled to Flink, and its stream-ordered Iceberg layout is optimized for streaming consumption rather than the value-ordered layout that batch query pruning prefers. It is a specialized storage layer, not a query engine or a catalog. For organizations whose central pain is real-time ingestion into Iceberg, Fluss is worth watching closely, but it complements rather than replaces the query and governance layers of the lakehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bauplan: Python-First Pipelines With Git-for-Data
&lt;/h2&gt;

&lt;p&gt;Bauplan is a serverless, Python-first lakehouse platform that treats data pipelines the way software engineers treat code: isolated, modular, testable functions with clean inputs and outputs. You write pure Python (or SQL), and Bauplan handles execution, isolation, and scaling with no Spark clusters or JVM to manage. Under the hood it stores everything as Apache Iceberg tables in your own S3, uses Arrow Flight to move data efficiently between functions, and integrates DuckDB as an execution engine with a custom Iceberg scan operator.&lt;/p&gt;

&lt;p&gt;Its signature feature is Git-for-data: zero-copy branching, commits, and merges built on Iceberg's metadata model. You can branch your data, run a pipeline on the branch, validate it, and merge only when tests pass, using a small set of APIs (branch, query, run, commit, merge). This is increasingly aimed at AI agents that need to build and test on production data safely.&lt;/p&gt;

&lt;p&gt;The catalog architecture matters for interoperability, and here Bauplan made the right choice. Bauplan's catalog is built on Project Nessie, the open-source Git-semantics catalog. Critically, Bauplan exposes a standards-compliant Iceberg REST catalog endpoint that external engines attach to directly. As Bauplan's documentation states, it "writes tables in Apache Iceberg format and exposes a standards-compliant Iceberg REST catalog. External engines discover tables through that catalog and read files directly from your bucket with their own cloud credentials." Snowflake, Databricks Unity Catalog, BigQuery, Trino, Athena, Spark, and DuckDB can all connect.&lt;/p&gt;

&lt;p&gt;Where Bauplan shines: developer experience for Python-native pipelines, reproducibility by design, and safe agentic workflows on Iceberg with no infrastructure to run. Because the output is open Iceberg tables in your own bucket, there is no lock-in.&lt;/p&gt;

&lt;p&gt;Where it falls short: branching is the wrinkle for interoperability. The standard Iceberg REST spec has no native concept of Git-style branches; Bauplan bridges this by mapping each branch onto a distinct REST catalog URL path, and a standard REST client sees only one branch at a time as ordinary Iceberg tables. The full branch and merge workflow runs through Bauplan's own SDK, and external writes through the REST path are constrained (Snowflake treats externally cataloged tables as read-only). Bauplan is a younger, smaller platform focused on the pipeline-building slice of the lakehouse rather than interactive serving or enterprise governance. It is an excellent producer and transformer that, through its REST endpoint, feeds the broader Iceberg estate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spice.ai: An Acceleration Layer and AI Compute Engine
&lt;/h2&gt;

&lt;p&gt;Spice.ai (Spice.ai OSS) is a portable SQL query, search, and LLM-inference engine written in Rust, built on Apache DataFusion, Apache Arrow, Arrow Flight, DuckDB, and the Iceberg Rust implementation. It ships as a single binary and is designed to run as a sidecar next to your application, federating across data sources and accelerating hot data locally. Spice v2.0-stable shipped on June 5, 2026, adding multi-node distributed query on Apache Ballista and the Spice Cayenne accelerator built on the Vortex columnar format.&lt;/p&gt;

&lt;p&gt;For Iceberg, Spice connects to REST, AWS Glue, or Hadoop catalogs and registers every table for SQL access automatically. Its key capability is acceleration: it can materialize a hot working set from Iceberg into a faster local engine (DuckDB, SQLite, or Cayenne) with configurable refresh, while Iceberg remains the source of truth. Spice supports INSERT with full ACID guarantees via Iceberg's transaction protocol, plus CREATE TABLE, DROP TABLE, and MERGE INTO for Iceberg catalogs, and it exposes a unified Iceberg REST Catalog API for query and write. It runs on iceberg-rust and supports inserting into partitioned tables and handling position and equality delete files.&lt;/p&gt;

&lt;p&gt;Where Spice shines: sub-millisecond reads on frequently queried Iceberg subsets, grounding AI agents and RAG applications in real data through built-in search and an OpenAI-compatible gateway, and edge-to-cloud deployment as a lightweight sidecar. It is an acceleration and AI-serving layer that sits on top of Iceberg, not a replacement for the lakehouse.&lt;/p&gt;

&lt;p&gt;Where it falls short: Spice is application-focused, deployed at the app or agent level rather than as a centralized data platform, so it is not your governance plane or your primary write path for heavy ETL. Its acceleration is a cache that needs refresh logic. For the right use case, grounding an AI application in governed Iceberg data with local speed, it is excellent, and because it speaks the REST catalog, it slots cleanly into a Dremio-centered or Polaris-centered lakehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  StarRocks: Sub-Second OLAP on Iceberg
&lt;/h2&gt;

&lt;p&gt;StarRocks is a high-performance, massively parallel OLAP engine that has become a favorite for low-latency, high-concurrency analytics directly on Iceberg. Through its multi-catalog architecture, StarRocks queries Iceberg tables in S3, HDFS, or other storage without ingestion, using a vectorized, SIMD-optimized execution engine written in C++. It connects to Hive Metastore, AWS Glue, and REST catalogs (including Nessie and Polaris) with credential vending support.&lt;/p&gt;

&lt;p&gt;On reads, StarRocks handles the full Iceberg optimization stack: manifest pruning, file-level data skipping from column statistics, Parquet row-group pruning, column projection, and late materialization, plus merge-on-read of position and equality delete files. Its cost-based optimizer and async materialized views make it especially fast for aggregation-heavy and join-heavy queries. StarRocks 4.0 pushed performance into the data layer, adding a global shuffle mechanism that produces fewer, larger files during writes: in the project's own tests, file count dropped from over 170,000 to 259 for a 100-partition workload, and ingestion latency fell by more than half.&lt;/p&gt;

&lt;p&gt;Where StarRocks shines: customer-facing dashboards, real-time BI, and API-served analytics where sub-second response and high concurrency matter. As one model architecture, real-time ingestion feeds StarRocks native tables for the hottest dashboards while Iceberg serves as the historical store, with StarRocks federating across both.&lt;/p&gt;

&lt;p&gt;Where it falls short: StarRocks writes copy-on-write only for Iceberg (partition overwrite) and does not produce equality-delete files, and it lacks UPDATE, DELETE, and MERGE on Iceberg in current versions, so it is primarily a query layer for data written by Spark, Flink, Dremio, or others. It has no native streaming ingestion. Running it well in production means operating FE and BE nodes and tuning caches. As a blazing-fast query engine on shared Iceberg tables it is outstanding, and the REST catalog is what lets it read the same tables your writers produce.&lt;/p&gt;

&lt;h2&gt;
  
  
  DuckDB: The Local Lakehouse Powerhouse
&lt;/h2&gt;

&lt;p&gt;DuckDB is the in-process analytical database often described as SQLite for analytics, and in 2026 it became a genuine Iceberg participant. The DuckDB-Iceberg extension added full read support and initial write support in v1.4.0, then delete and update support for V2 tables in v1.4.2, and by v1.5.3 it added MERGE INTO, ALTER TABLE for schema evolution, partition transforms, and V3 support including binary deletion vectors written as Puffin files and the VARIANT type. The extension attaches to Iceberg REST catalogs such as Apache Polaris and Lakekeeper using OAuth2 secrets, with the catalog as the authority for commits and discovery.&lt;/p&gt;

&lt;p&gt;Most striking, DuckDB-Wasm shipped the Iceberg extension by December 2025, making DuckDB the first end-to-end interface to Iceberg REST catalogs from within a browser tab, reading and writing tables with no backend server. A demo querying Amazon S3 Tables from the browser ran at AWS re:Invent 2025.&lt;/p&gt;

&lt;p&gt;Where DuckDB shines: local and embedded analytics on Iceberg, fast iteration, notebook and laptop workflows, and lightweight serverless patterns. Through the REST catalog it reads and writes the same governed tables as the big engines, then hands results to other Arrow-native tools with zero-copy.&lt;/p&gt;

&lt;p&gt;Where it falls short: DuckDB is single-node and in-process, not a distributed engine for petabyte-scale jobs or high-concurrency serving. Its write path has real limits: UPDATE and DELETE work only on unpartitioned, unsorted tables, it writes positional deletes rather than copy-on-write, and reading from REST catalogs backed by non-S3 storage is not yet supported. Real-world write attempts against some managed REST catalogs have hit rough edges. DuckDB is a superb local client in the Iceberg ecosystem, best paired with a central catalog and a serving engine rather than asked to be the whole platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache DataFusion: The Engine Builder's Engine
&lt;/h2&gt;

&lt;p&gt;Apache DataFusion is not an end-user product; it is a very fast, extensible query engine written in Rust on Apache Arrow, designed to be embedded as the foundation of new data systems. It originated in the Apache Arrow project and offers SQL and DataFrame APIs, vectorized multithreaded execution, and many extension points. Its TableProvider trait is how Iceberg, Delta Lake, and custom sources plug in, and the Iceberg Rust implementation provides IcebergTableProvider (catalog-backed, with metadata refresh for writes) and IcebergStaticTableProvider (read-only snapshot access).&lt;/p&gt;

&lt;p&gt;DataFusion's importance to the Iceberg world is that it is the engine inside the engines. Spice.ai, LakeSail, dbt's Fusion engine, and RisingWave all build on it. When RisingWave replaced its batch engine with DataFusion, the win came partly because DataFusion's native in-memory format is Arrow RecordBatch, the same format the Iceberg Rust SDK produces, eliminating costly row-by-row format conversion at scale. For most new data platform projects in 2026, DataFusion is the default foundation in Rust.&lt;/p&gt;

&lt;p&gt;Where DataFusion shines: as a building block for anyone constructing a new Iceberg-aware tool, offering Arrow-native speed, trait-based extensibility, and Substrait support for passing query plans across engines. Its Iceberg integration through iceberg-rust is improving steadily.&lt;/p&gt;

&lt;p&gt;Where it falls short: it is a library, not a deployable lakehouse. There is no catalog, no governance, no semantic layer, no autonomous maintenance. Iceberg write support through the Rust ecosystem is still maturing. You would not hand DataFusion to a business analyst. For the reader evaluating tools, the practical takeaway is to understand what your tools are built on, because knowing a product uses DataFusion tells you about its extensibility and interoperability ceiling. In an Iceberg lakehouse, DataFusion is the substrate beneath several other entries on this list, not a direct competitor to them.&lt;/p&gt;

&lt;h2&gt;
  
  
  ClickHouse: Fast Analytics, Maturing Iceberg
&lt;/h2&gt;

&lt;p&gt;ClickHouse is famous for sub-second queries on billions of rows, and over 2024 and 2025 it built out genuine Iceberg integration. Its DataLakeCatalog database engine connects to external catalogs and auto-discovers all tables, with support for REST catalogs and Apache Polaris since 24.12, AWS Glue and Databricks Unity Catalog since 25.3, Hive Metastore since 25.5, and Microsoft OneLake since 25.11. It handles schema evolution, partition pruning, time travel by snapshot or timestamp since 25.4, and a native Parquet reader added in 25.8 that improved Parquet query speed by 1.8x on average across ClickBench. Write support landed with INSERT in 25.7 and expanded through 25.9 to ALTER UPDATE, ALTER DELETE, and distributed writes against REST and Glue catalogs, and the 26.2 release declared Iceberg INSERTs production ready. A new integration with Google's Lakehouse Runtime Catalog via the Iceberg REST API debuted in beta in ClickHouse 26.2.&lt;/p&gt;

&lt;p&gt;Where ClickHouse shines: extremely fast queries on log, event, and high-cardinality time-series data, and a flexible, cloud-agnostic, catalog-agnostic engine that can federate across multiple catalogs and JOIN between them in one SQL statement. For hot-and-cold tiering, where ClickHouse native storage serves the hottest queries and Iceberg holds history, it is a strong fit.&lt;/p&gt;

&lt;p&gt;Where it falls short: ClickHouse's deepest performance comes from its own MergeTree format, and queries on Iceberg are slower than on native data because the internal storage and engine are tightly co-optimized. Write support, while advancing fast, is newer and less proven than read support; the Altinity team, building Project Antalya, noted that ClickHouse writes to Iceberg remained immature for many special cases and recommended Java Iceberg libraries for bulk loading. V3 features like deletion vectors were still being worked toward. ClickHouse is an excellent query engine to attach to your Iceberg tables through the REST catalog, especially for event analytics, but it is a specialized speed layer rather than the governed center of the lakehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  LakeSail: A Rust-Native Spark Replacement
&lt;/h2&gt;

&lt;p&gt;LakeSail's Sail is a drop-in Apache Spark replacement written in Rust, compatible with the Spark Connect protocol so existing PySpark and Spark SQL workloads run unchanged. It is built on Apache DataFusion and Apache Arrow, has no JVM overhead, and per the lakehq/sail GitHub README is "~4× faster (up to 8× in specific workloads) than Spark and 94% cheaper on infrastructure costs," measured on a derived TPC-H benchmark. Sail 0.4 added native Apache Iceberg support, built on the iceberg-rust specification and utilities, with native compatibility with the Iceberg REST Catalog supported by services such as Apache Polaris and Cloudflare R2 Data Catalog. It integrates with the Iceberg REST Catalog, AWS Glue, Unity Catalog, Hive Metastore, and Microsoft OneLake.&lt;/p&gt;

&lt;p&gt;Where LakeSail shines: teams running large Spark workloads who want the same API at lower cost and faster startup, with native Iceberg and Delta support and an MCP server for AI agents. Because it keeps the Spark interface but rebuilds the engine in Rust, migration is meant to be a configuration change rather than a rewrite. Its positioning around the composable data stack, Arrow-native and Spark-compatible, fits the open lakehouse philosophy well.&lt;/p&gt;

&lt;p&gt;Where it falls short: Sail is a young project (it celebrated its first birthday in 2026), so production maturity, ecosystem depth, and feature parity with Spark's full surface are still developing. It runs in your AWS account with setup steps, and like Spark it is a processing engine rather than a catalog, governance layer, or interactive serving platform. For the batch and stream processing role, LakeSail is a promising, cost-efficient alternative to Spark that, through the REST catalog, produces and consumes the same Iceberg tables as everything else in your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Assemble These Tools Into One Lakehouse
&lt;/h2&gt;

&lt;p&gt;Read across all thirteen assessments and a pattern emerges. Each tool has a role where it is genuinely the best choice, and almost none of them is a complete lakehouse on its own. Spark and LakeSail write and transform at scale. Flink and Fluss bring real-time streams into Iceberg. Bauplan builds Python-native pipelines with branching. StarRocks and ClickHouse serve fast queries on specific workloads. DuckDB and Spice.ai handle local and application-level analytics and AI grounding. DataFusion is the substrate several of them share. Snowflake and Databricks are powerful engines that also want to be platforms.&lt;/p&gt;

&lt;p&gt;What makes this collection work as one architecture rather than a pile of disconnected tools is the Iceberg REST catalog. It is the standard interface that lets Spark write a table, StarRocks serve it, DuckDB explore it, and an AI agent query it, all on one copy of data with one governance model. Every tool worth adopting in 2026 either speaks REST or is racing to. When you evaluate any of these tools, the first question should be how cleanly it reads and writes through a REST catalog, because that determines whether you can use it where it shines without fragmenting your data.&lt;/p&gt;

&lt;p&gt;And that is the case for putting Dremio at the center. The other twelve tools are specialists. Dremio is the platform built natively on Iceberg, Polaris, and Arrow that provides the catalog (managed Apache Polaris with REST and credential vending), the high-performance Arrow-native SQL engine, the autonomous table maintenance that keeps streaming and batch tables query-ready, the semantic layer that gives both humans and AI agents shared business context, and a pricing model that charges only for Dremio compute so you stay free to use every other engine against the same catalog. You can let Flink stream, Spark batch-load, Bauplan build, and StarRocks serve, and have all of it land in a governed, optimized, openly accessible Iceberg estate that Dremio coordinates. That is what a lakehouse is supposed to be: open at every layer, fast where it counts, and free of the lock-in that the format was designed to eliminate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the best query engine for an Apache Iceberg lakehouse?&lt;/strong&gt;&lt;br&gt;
It depends on the workload, which is the whole point of an open lakehouse. For a governed central engine with native Iceberg processing, autonomous optimization, and a built-in catalog and semantic layer, Dremio is the strongest hub. For sub-second customer-facing dashboards, StarRocks excels. For event and log analytics, ClickHouse is fast. For local and embedded work, DuckDB. The REST catalog lets you use each where it fits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is the Iceberg REST catalog so important?&lt;/strong&gt;&lt;br&gt;
The REST catalog is the standard API that every engine uses to discover tables, resolve metadata, coordinate atomic commits, and enforce access. It is what enables true multi-engine interoperability on a single copy of data. Without it, you fragment your estate across incompatible catalogs and recreate the silos Iceberg was meant to remove. Apache Polaris is the leading open, vendor-neutral REST catalog implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do Snowflake and Databricks really support Iceberg now?&lt;/strong&gt;&lt;br&gt;
Yes. Both made Iceberg V3 generally available in 2026 and expose Iceberg REST Catalog APIs (Snowflake through Horizon Catalog and Open Catalog, Databricks through Unity Catalog) so external engines can read and, in many cases, write. But both engines are optimized for their own formats and pull workloads toward their platforms, so in a multi-engine lakehouse they are best treated as powerful engines connected through the catalog rather than the neutral center.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Iceberg V3 and which tools support it?&lt;/strong&gt;&lt;br&gt;
Iceberg V3 is the third major version of the table format, adding deletion vectors (faster DML), row lineage (native CDC), the VARIANT type for semi-structured data, default column values, geospatial types, and nanosecond timestamps. As of mid-2026 it is generally available on Dremio, Snowflake, Databricks, Spark on EMR, and AWS services, with support in DuckDB, Flink, and others. Always verify that every downstream reader supports V3 before upgrading production tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use open source tools instead of a commercial platform?&lt;/strong&gt;&lt;br&gt;
Absolutely. Apache Spark, Flink, Fluss, DataFusion, DuckDB, StarRocks, Polaris, and others are open source and can compose a full lakehouse. The trade-off is the integration and operational work: catalog, optimization, compaction, governance, and a semantic layer all become your responsibility. A platform like Dremio bundles those into one product so your team builds data products instead of maintaining infrastructure, while still keeping your data in open Iceberg tables you can access with any engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go Deeper on the Data Lakehouse and AI
&lt;/h2&gt;

&lt;p&gt;The Apache Iceberg lakehouse is the default architecture for analytics and AI in 2026, and the tools above are how you bring it to life. If you want to understand the full picture, from open table formats and the Iceberg REST catalog to Apache Polaris and agentic AI on the lakehouse, read my books on Apache Iceberg, Apache Polaris, and data lakehouses. Find them all at books.alexmerced.com.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Apache Iceberg v4: The Current State, the Proposals, and Why They Matter</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Tue, 09 Jun 2026 20:30:50 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-iceberg-v4-the-current-state-the-proposals-and-why-they-matter-3e07</link>
      <guid>https://dev.to/alexmercedcoder/apache-iceberg-v4-the-current-state-the-proposals-and-why-they-matter-3e07</guid>
      <description>&lt;p&gt;A few years ago the question about Apache Iceberg was whether open table formats could replace proprietary warehouses. That question is closed. Iceberg won. The new question is sharper and more interesting. What do we do with it next?&lt;/p&gt;

&lt;p&gt;That is the question driving Iceberg v4.&lt;/p&gt;

&lt;p&gt;At Iceberg Summit 2026 in San Francisco, more than 600 people gathered for two days and over 70 sessions. Not one talk tried to convince the room to adopt Iceberg. Every session assumed you already run it in production. The energy went somewhere else. It went to the limitations that success created, and to the spec changes that fix them.&lt;/p&gt;

&lt;p&gt;This post walks through the state of v4 as of June 2026. It covers each major proposal, how the proposal works at a technical level, and why it matters for the people who run Iceberg at scale. It also covers the live debates, since v4 is not finished and the arguments on the dev list tell you as much as the design documents do.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rf7gtkepbblzkkimjai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rf7gtkepbblzkkimjai.png" alt="Apache Iceberg V4" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where v4 stands today
&lt;/h2&gt;

&lt;p&gt;Start with the honest status. Iceberg v4 is not released. It is not finalized. It exists as design documents, GitHub issues, Iceberg Enhancement Proposals, and long threads on the dev mailing list. The current stable release is 1.10.0 from September 2025, and that release sits firmly in the v3 era.&lt;/p&gt;

&lt;p&gt;The practical guidance has not changed. Treat v3 as the production target. Treat v4 as the horizon worth watching. Build on what is stable and tested rather than waiting on features that have no committed ship date.&lt;/p&gt;

&lt;p&gt;That said, v4 is no longer a vague wish list. The Summit made that clear. The proposals presented there were not academic. They were direct answers to operational pain that real teams hit at scale. And the people shaping them are the people who feel that pain most. Engineers from Google, Apple, Snowflake, Databricks, Microsoft, Netflix, and LinkedIn sit in the same design discussions and review the same pull requests. That is part of why the community trusts the direction even before the vote.&lt;/p&gt;

&lt;p&gt;You can already see fragments of v4 leaking into the official spec text. The spec now describes behavior for "v4 and later" when it talks about how a table location is handled. That is a small detail, but it signals that the spec authors have started writing v4 semantics into the document itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dk8y2zzi2mpmematk7a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dk8y2zzi2mpmematk7a.png" alt="Evolution of Apache Iceberg" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Iceberg metadata works today, in plain terms
&lt;/h2&gt;

&lt;p&gt;To understand the proposals, you need a quick mental model of how Iceberg tracks a table right now. Skip this section if you already know it cold.&lt;/p&gt;

&lt;p&gt;Iceberg replaced the old Hive approach of tracking data by directory. Hive mapped each partition to a folder and treated every file in that folder as part of the table. That worked on HDFS where directory listings were fast. It broke on object storage like S3, where listing millions of files across nested partitions got slow and expensive, and where request-rate throttling caused real outages.&lt;/p&gt;

&lt;p&gt;Iceberg fixed this by tracking individual files through a tree of metadata. The tree has a few layers.&lt;/p&gt;

&lt;p&gt;Data files hold the actual rows, usually in Parquet. Manifest files list groups of data files along with per-file statistics like row counts and the min and max value of each column. A manifest list collects all the manifests that make up one snapshot. A metadata file, written as JSON, points to the current snapshot and stores table-level details like schema, partition spec, sort orders, and snapshot history.&lt;/p&gt;

&lt;p&gt;Every commit produces a new immutable snapshot. Readers get a consistent point-in-time view. Writers add data through atomic swaps of the metadata pointer. This is what gives Iceberg time travel, rollback, and snapshot isolation on cheap object storage.&lt;/p&gt;

&lt;p&gt;The payoff of this tree shows up at query time. An engine reads the metadata, checks the per-file statistics, and skips any file whose min and max values cannot match the query filter. It does this without listing directories or opening data files. Scan planning becomes a metadata lookup rather than a full scan of the storage layout. A single table can hold tens of petabytes, and an engine can still plan a query against it quickly, since it reads metadata instead of crawling files. That property is the core architectural advantage of Iceberg, and every v4 proposal is careful to protect it.&lt;/p&gt;

&lt;p&gt;The spec has grown in clear stages. V1 set the foundation with immutable data files, snapshots, hidden partitioning, and safe schema evolution. V2 added delete files, which let engines mark rows for removal without rewriting whole data files. That made row-level updates and merge-on-read practical, and it powered change data capture and GDPR deletions. V3, shipped across the 1.8 through 1.10 releases in 2025, added binary deletion vectors, the variant type for semi-structured data, native geometry and geography types, nanosecond timestamps, row lineage, default column values, multi-argument partition transforms, and table encryption keys.&lt;/p&gt;

&lt;p&gt;Each version solved real problems. And each version exposed the next set of problems. That brings us to v4.&lt;/p&gt;

&lt;p&gt;The pattern behind v4 is consistent. Iceberg was built for large, slow-moving analytical tables. The workloads people run on it now are anything but slow-moving. Streaming pipelines commit every few seconds. Machine learning feature tables carry thousands of columns. Disaster recovery plans demand that a table can move between buckets and regions. The metadata design that served batch analytics well becomes the bottleneck under these new patterns. V4 attacks that bottleneck from several angles at once.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy5qbbifqi13iqaknyb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foy5qbbifqi13iqaknyb6.png" alt="The Iceberg Metadata" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposal one: adaptive metadata trees and single-file commits
&lt;/h2&gt;

&lt;p&gt;This is the headline proposal, and the most ambitious one.&lt;/p&gt;

&lt;p&gt;Look at what a commit costs today. Even a tiny write produces a new metadata.json, a new manifest list, and one or more new manifest files. The change might touch one data file. The metadata work touches several files anyway. This is write amplification, and it shows up as commit latency.&lt;/p&gt;

&lt;p&gt;For a batch job that runs once an hour, the cost is invisible. For a streaming job that commits every few seconds, the cost is fatal. The metadata writing dominates, the small files pile up, and object storage starts throttling requests against the shared prefix. Delete operations make it worse. Under copy-on-write, a delete can trigger a full manifest rewrite. Caching manifests across commits gets hard, since the files keep getting replaced.&lt;/p&gt;

&lt;p&gt;The v4 answer is a restructured metadata tree built around a Root Manifest. The Root Manifest replaces the old manifest list and serves as the single entry point for each snapshot. The hierarchy collapses into a clean two-level shape.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Root Manifest -&amp;gt; Data Manifests / Delete Manifests / Files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key behavior is that a commit modifies only what changed. Metadata growth becomes proportional to the size of the operation, not the size of the table. A one-file write produces a one-file change. The benefits land immediately. Commits get faster. Rewrites get rarer. Query planning improves too, since the Root Manifest can aggregate file-level metrics from its children, which lets engines prune earlier in planning.&lt;/p&gt;

&lt;p&gt;The word "adaptive" is the important part of the design. The proposal does not force every write to be a single-file commit. Small writes can be inlined directly into the root for low latency. As the root fills up, background maintenance rebalances entries down into leaf manifests. Writers can also choose to pay the rebalancing cost at a moment that suits them. The structure adapts to the workload. A streaming table keeps its hot writes near the root for speed. A batch table behaves more like the classic layout. One spec, two operating modes, chosen by the shape of the work.&lt;/p&gt;

&lt;p&gt;This is the proposal that enables low-latency writes without giving up read performance on huge tables. That combination is the whole point. Streaming wants fast small commits. Analytics wants fast pruning over petabytes. The adaptive tree tries to serve both from one structure.&lt;/p&gt;

&lt;p&gt;Make this concrete. Picture a Flink job pulling from Kafka and committing to an Iceberg table every five seconds. Under v3, each of those commits writes a fresh metadata.json, a fresh manifest list, and at least one new manifest, even when the commit added a single small data file. Over an hour that is more than 700 commits, each one multiplying a tiny data write into several metadata writes. The small files pile up against one storage prefix and trigger throttling. Teams work around this with frequent compaction jobs that fight the ingestion they are trying to support. Under the v4 adaptive tree, those same 700 commits inline their tiny changes near the root and rebalance in the background. The write path stops multiplying. The compaction pressure drops. The streaming job and the table maintenance stop competing.&lt;/p&gt;

&lt;p&gt;Now the honest part. This proposal carries the liveliest debate on the dev list, and the questions are good ones.&lt;/p&gt;

&lt;p&gt;If small commit entries get inlined into the root, then a reader has to scan those inlined entries to plan a query. People asked whether the spec accepts a linear scan cost as the price of write throughput, or whether there is a pre-index mechanism that avoids decoding data pages for every sub-second query. People also asked about the catalog. A REST catalog under high concurrency might have to perform partial Parquet decodes on hundreds of inlined entries per request. That risks turning the catalog into a mini query engine just to do basic partition pruning. And there is a circular risk. If the fix for scan cost is to flush entries to leaf manifests more often, then you reintroduce the frequent-small-file problem and the object storage throttling that single-file commits were meant to solve in the first place.&lt;/p&gt;

&lt;p&gt;None of these questions are fatal. They are the normal tension between write speed and read speed, played out in a new structure. But they are the reason v4 is still a proposal and not a release. The community is working through the amortized cost analysis. How big should the root buffer be. How often should rebalancing run. How do different workloads at different scale factors change the answer. These are the details that get settled over months before a vote.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd5jqr4k2121rtosald2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd5jqr4k2121rtosald2.png" alt="Iceberg Adapative Metadata Tree" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposal two: storing metadata in Parquet instead of Avro
&lt;/h2&gt;

&lt;p&gt;Since the early versions, Iceberg has stored its metadata files in Apache Avro. Avro is row-based. That choice was sensible when manifests were small and engines read them as whole records.&lt;/p&gt;

&lt;p&gt;Tables grew. Manifests grew with them. A wide table can carry hundreds of columns, and each manifest entry then carries hundreds of per-column statistics. The problem is that Avro forces an engine to deserialize an entire record even when it needs only a sliver of it. During query planning, an engine often wants just the file path and the min and max of a single column. With Avro it pays to read everything.&lt;/p&gt;

&lt;p&gt;The v4 proposal moves metadata to a columnar format using Apache Parquet. This is the same format that already stores the data in most Iceberg tables. The win is direct. An engine can read only the columns of metadata it needs. Column pruning and predicate pushdown, the same tricks that make Parquet fast for data, now apply to metadata queries too. Memory use drops. Planning gets faster on wide tables.&lt;/p&gt;

&lt;p&gt;There is a pleasing symmetry here. Metadata storage starts to look like data storage. The same engine machinery that scans Parquet data files can scan Parquet metadata files. And this proposal pairs naturally with the adaptive metadata tree. As the metadata gets richer and more expressive, columnar reads keep planning fast. You get more detail in the metadata without paying to read all of it on every query.&lt;/p&gt;

&lt;p&gt;The change does raise a compatibility question that the community has to handle with care. Every existing engine reads Avro metadata today. A move to Parquet metadata means every reader and writer needs to learn the new format, and tables written under v4 with Parquet metadata will not open in an engine that only knows the older layout. This is the normal cost of a format version bump, and it is why v4 is a new spec version rather than a patch. Engines will add v4 support over a period of months, the same way v3 support rolled out across the ecosystem during 2025. The reward is worth the transition. Metadata reads stop being a tax that grows with table width.&lt;/p&gt;

&lt;p&gt;The dependency runs both ways with the column statistics rework, which is the next proposal. Columnar metadata is the container. Better-typed statistics are part of what fills it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6e4bzn93hu8u5jj8e7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6e4bzn93hu8u5jj8e7w.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposal three: reworking column statistics into first-class data
&lt;/h2&gt;

&lt;p&gt;This proposal sounds small. It is not. It quietly opens the door to a class of workloads Iceberg was never designed for.&lt;/p&gt;

&lt;p&gt;Look at how stats work today. For each column, Iceberg stores lower and upper bounds, null counts, and value sizes as a generic map from a field ID to a value. The map is flexible, and it functions, but it has three weaknesses. It is inefficient for wide tables, since you carry a big map per file. It loses type information during serialization, so an engine cannot always trust the physical and logical type of a bound. And it makes it hard to project only the specific stats you want, since the map is opaque.&lt;/p&gt;

&lt;p&gt;The v4 proposal introduces a typed, structured representation of column statistics. Each field's stats get stored with their logical and physical types preserved. That makes them reliable across schema evolution, where types and IDs shift over a table's life. Engines can read individual stats, like just the lower bounds for three columns, without loading the whole stats payload into memory.&lt;/p&gt;

&lt;p&gt;The part that matters most is extensibility. A typed, structured stats model lets developers attach richer per-field metrics. For a variant column you might attach stats that describe its nested fields. For a geometry column you might attach a bounding box. And the structure can hold entirely new kinds of metrics. This is where vector search enters the conversation.&lt;/p&gt;

&lt;p&gt;Approximate nearest neighbor search, the operation at the heart of vector databases and retrieval for AI, needs index structures that the current stats map simply cannot express. By rebuilding column statistics for flexibility, v4 opens the door to new index types that support these queries. An Iceberg table could carry the metadata needed to prune candidates for a similarity search the same way it prunes files for a range filter today. That turns Iceberg into a more serious home for the feature and embedding tables that AI workloads generate.&lt;/p&gt;

&lt;p&gt;The chain of dependencies is now visible. Columnar Parquet metadata gives you a container that supports column pruning. Typed statistics give you the structured, extensible content to put in that container. The adaptive tree keeps commits cheap so you can write and update all of this without write amplification. The three proposals are not independent. They are one coordinated redesign of the metadata layer, split into pieces that can be reviewed and voted on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fboper3hm5hrkmi3z3h7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fboper3hm5hrkmi3z3h7c.png" alt="column stats" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposal four: relative paths and relocatable tables
&lt;/h2&gt;

&lt;p&gt;This proposal fixes an operational headache that has annoyed teams for years.&lt;/p&gt;

&lt;p&gt;Iceberg stores file references as absolute URIs. Every manifest and metadata file embeds the full path to the files it points at, including the bucket and region. That was a deliberate early decision. Absolute paths solved real consistency problems on eventually-consistent object stores, where a stale or ambiguous reference could corrupt a read.&lt;/p&gt;

&lt;p&gt;The cost shows up the moment you need to move a table. Copy a table to a new bucket, a new region, or a different storage system, and every embedded path is now wrong. You have to rewrite the metadata to point at the new location. For a large table with deep metadata, that rewrite is slow and expensive. It turns routine operations into projects. Replication, disaster recovery backups, and multi-region deployments all run into this wall.&lt;/p&gt;

&lt;p&gt;The v4 proposal adds support for relative paths inside table metadata. References get stored relative to the table root rather than as absolute URIs. Move the table root, and the internal relationships between metadata and data files stay valid without a rewrite. Copy the whole directory tree somewhere else, and it just works. Absolute paths remain available where you still need them, such as references to external data that lives outside the table root.&lt;/p&gt;

&lt;p&gt;The payoff is portability. A table becomes a self-contained, relocatable unit. You can replicate it to another region for disaster recovery and not pay a metadata rewrite tax. You can clone it for testing. You can migrate it between storage systems during a cloud transition. The Summit framing put it plainly. Relative paths eliminate entire categories of expensive metadata rewrites.&lt;/p&gt;

&lt;p&gt;This is the proposal that is furthest along in the spec text. The spec already describes how table location works for "v4 and later," and the model assumes a catalog will provide the table's location rather than baking it into every file reference. That is a clean separation. The catalog knows where the table lives. The metadata describes the table's internal structure in terms relative to that location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi37zzknmgwczzaecbqeu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi37zzknmgwczzaecbqeu.png" alt="relative paths" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Proposal five: column families and efficient column updates for wide AI tables
&lt;/h2&gt;

&lt;p&gt;This proposal targets the workload that did not exist when Iceberg was designed. Wide tables for machine learning.&lt;/p&gt;

&lt;p&gt;Picture a feature table with 200 columns, or an embedding table where each row holds a large vector. Now picture a daily job that recomputes 5 or 10 of those features and leaves the rest untouched. Or a job that refreshes prediction scores after a model retrains. Or one that regenerates vector embeddings after a new embedding model ships.&lt;/p&gt;

&lt;p&gt;In Iceberg today, all of these jobs pay the same brutal price. Updating any column means rewriting the entire row. A small update to a handful of features forces a full rewrite of files that hold all 200 columns. At petabyte scale this is cost-prohibitive. The write amplification is enormous. You touch 5 percent of the data and rewrite 100 percent of the files.&lt;/p&gt;

&lt;p&gt;The proposal, tracked in GitHub issue 15146 as "Efficient column updates in Iceberg," attacks this directly. The idea is to write only the updated columns to separate column files and leave the unchanged columns sitting in the original base files. At read time, the engine stitches the column files together with the base files to materialize complete rows. You update the embedding column by writing a new embedding column file. The other 199 columns never move.&lt;/p&gt;

&lt;p&gt;This is the column families pattern, and the Summit described it as first-class support for wide tables. Column groups get stored and evolved independently. New features can be backfilled into a table without touching the rest of it. A team can add a column family of fresh features and write only that family.&lt;/p&gt;

&lt;p&gt;The use cases the proposal calls out map exactly to AI pipelines. Model score updates after retraining. Embedding refresh, which today triggers a full row rewrite. Incremental feature computation, where a daily batch touches a tiny fraction of a wide table's columns. These are not edge cases for AI teams. They are the daily routine.&lt;/p&gt;

&lt;p&gt;This proposal leans hard on the others. It builds on single-file commits and on the column statistics rework. The design notes that explicitly. You need cheap commits to write column updates without amplification, and you need good per-column stats to keep reads fast once the data is split across base files and column files. The current draft scopes itself to updates that touch a column across all rows. Partial updates that touch a subset of rows are left for later work.&lt;/p&gt;

&lt;p&gt;The design debate here is genuinely interesting, and it is not settled. Several contributors asked whether this belongs in Iceberg at all, or whether the right fix lives in Parquet. Parquet has a long-running effort to make its footer cheaper to read, including a proposal to replace the footer with FlatBuffers for dramatically faster reads. Parquet could introduce a concept of logical and physical files to manage a column-to-file mapping. The counterargument is that a column-to-file mapping inside Parquet starts to look like another manifest, which duplicates the job Iceberg already does. Other contributors pointed at how Lance, Hudi, and Paimon handle partial updates and column groups, and asked what Iceberg should borrow. One useful observation from the thread is that splitting a wide table into independently updated column families also reduces commit conflicts, since separate writers update separate families instead of serializing writes against one table.&lt;/p&gt;

&lt;p&gt;This is the proposal that most clearly signals where Iceberg is heading. The format is being shaped to treat AI and machine learning data as a first-class workload, not a batch analytics afterthought.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrv27osp1t7pqyt1z9b5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrv27osp1t7pqyt1z9b5.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Other proposals in the conversation
&lt;/h2&gt;

&lt;p&gt;The five proposals above carry the most momentum, but they are not the whole v4 conversation. Several other ideas show up in the design documents and the dev list, and they are worth knowing about even if they are earlier in the process.&lt;/p&gt;

&lt;p&gt;Multi-table transactions and catalog-level semantics come up often. Today an Iceberg commit is atomic for a single table. A pipeline that writes to several tables and needs all of them to commit together, or none of them, has to build that coordination itself. Many teams want a way to commit across tables atomically, so that a fact table and its related dimension tables move as one unit. This kind of catalog-level transaction would be transformative for complex pipelines, and it has been flagged as one of the most-watched horizon features. It is also one of the hardest to design, since it pushes transactional guarantees up from the table into the catalog, and the REST catalog spec would have to carry the new semantics. Expect this one to take time.&lt;/p&gt;

&lt;p&gt;Refinements to the v3 types also continue. The variant type, added in v3 for semi-structured data, has room for richer operations and better statistics, and the column statistics rework feeds directly into making variant queries faster. The geospatial types added in v3 invite extended capabilities for spatial indexing and filtering. Row lineage, the feature that gives each row a persistent identity across commits, has open discussion about making incremental processing even cheaper. None of these are headline rewrites of the format. They are the steady tightening that happens once a feature ships and real workloads reveal the rough edges.&lt;/p&gt;

&lt;p&gt;There is also ongoing work at the file-format layer that v4 depends on, even though it lives outside the Iceberg spec. The Parquet community is working to make the footer cheaper to read, including a proposal to replace it with FlatBuffers for faster metadata access. Parquet and Arrow are evolving for the AI era in parallel with Iceberg. The Summit paired the Iceberg metadata talks with sessions on evolving Parquet and Arrow for what comes next, since the table format and the file format have to move together. A faster Parquet footer makes columnar Iceberg metadata faster to read. Better Parquet support for column-level updates makes the column families proposal cleaner. The layers are coupled, and the communities coordinate.&lt;/p&gt;

&lt;p&gt;Keep the maturity levels straight when you read about these. Single-file commits, Parquet metadata, typed statistics, relative paths, and column families have concrete design documents and active pull requests. Multi-table transactions and the type refinements are real conversations with less settled design. Treat the first group as the likely core of v4 and the second group as candidates that may land in v4 or may slip to a later version.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0smycq0fqrx1qevyouh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0smycq0fqrx1qevyouh.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The convergence question: Iceberg v4 and Delta 5.0
&lt;/h2&gt;

&lt;p&gt;No discussion of v4 is complete without the Databricks angle, since it reframes the whole conversation.&lt;/p&gt;

&lt;p&gt;In the run-up to the Summit, Databricks announced that Iceberg v4 will rethink the core metadata structure with an adaptive metadata tree, and that Databricks is proposing Delta Lake 5.0 adopt the same structure. The pitch is convergence. One metadata layout that both Delta and Iceberg read and write directly. No translation layer like UniForm. No conversion tools like XTable. The two formats would sit on a shared on-disk foundation.&lt;/p&gt;

&lt;p&gt;The technical claim is that Delta and Iceberg have already converged on the same ideas. Both moved to columnar metadata for efficient pruning. Both use manifest-style trees for scalability. Both adopted deletion vectors for fast updates. Yet today each maintains its own separate metadata structure, which duplicates effort and forces translation when you want to read one format from the other format's engine. Databricks proposes that Delta 5.0 adopt the Iceberg v4 metadata tree as its native content metadata. The result would be a single structure that clients of either format read and write with no conversion overhead.&lt;/p&gt;

&lt;p&gt;If this lands, the practical effect is large. The word you pick, Delta or Iceberg, would describe history rather than architecture. Switching formats would cost nothing at the metadata layer. That changes the competitive picture for every vendor that built a business on format choice.&lt;/p&gt;

&lt;p&gt;The context behind this matters. In June 2024, Databricks paid more than a billion dollars for Tabular, the company founded by the original creators of Iceberg. The revenue multiple was indefensible on paper. The strategic logic was exact. The acquisition brought the architects of Iceberg inside Databricks. Two years later, the people who built the open format that was positioned as the alternative to Databricks now help steer how Databricks governs that format. The firm shaping the convergence narrative is the firm that bought the right to shape it.&lt;/p&gt;

&lt;p&gt;Here is the part to hold onto. Convergence is a proposal, not a decision. The Iceberg community has to accept the direction, and that acceptance is an open conversation. It is the kind of debate that plays out over months on the dev list, the same way the single-file commit details are being argued. A proposal from one large vendor, even a vendor that employs the format's creators, still has to win the community vote. The governance model is the whole point of Iceberg. No single vendor can unilaterally change the spec in ways that disadvantage the others. That is what makes the format trustworthy for long-term architecture decisions. The convergence idea will be tested against that model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf2evh7yq4ng5w5fv5ax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf2evh7yq4ng5w5fv5ax.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is happening now: streaming, AI, and a maturing ecosystem
&lt;/h2&gt;

&lt;p&gt;Step back and the pattern across all five proposals is one story. Iceberg outgrew its original design assumptions, and v4 is the format catching up to its own success.&lt;/p&gt;

&lt;p&gt;The workloads tell the story. Streaming pipelines commit every few seconds, and the old metadata tree cannot tolerate that commit latency. The adaptive tree and single-file commits answer streaming. Machine learning produces tables with thousands of columns and constant small updates, and the old layout forces full rewrites. Column families and efficient column updates answer ML. AI retrieval needs index structures the old stats map cannot hold, and the column statistics rework answers vector search. Disaster recovery and cloud migration need portable tables, and relative paths answer portability. Each proposal maps to a workload that was rare or nonexistent when v1 shipped.&lt;/p&gt;

&lt;p&gt;The ecosystem reached the maturity to support this push. A spec is only as useful as the tools that implement it, and Iceberg's tooling crossed a threshold. The REST catalog turned from a convenience into the connective tissue of the open lakehouse. Any engine, JVM-based or not, can work with Iceberg tables through one common interface. Apache Polaris graduated to an Apache top-level project on February 18, 2026, after incubating for 18 months with contributions from Google, Microsoft, Confluent, and many others. The catalog is becoming the control plane for governance, security, and multi-tenant access.&lt;/p&gt;

&lt;p&gt;Iceberg is also no longer a JVM-only project. The Rust implementation now powers the native scan operator in DataFusion-Comet, bypassing Spark's JVM overhead. A C++ implementation is emerging for engines that need predictable memory and SIMD-optimized execution. PyIceberg crossed 500,000 daily downloads on PyPI, and teams run it in production without ever spinning up Spark. These are production-grade implementations, and they widen who can build on Iceberg and where it can run.&lt;/p&gt;

&lt;p&gt;Multi-engine access became routine rather than aspirational. Spark handles ingestion while Snowflake, Trino, DuckDB, or Flink serve queries, and teams describe this as established architecture. The interoperability promise Iceberg made years ago is now operational reality across cloud boundaries. The net effect is that adopting Iceberg no longer demands a single monolithic technology choice. You pick the catalog that fits your governance model, the engine that fits your latency needs, and the language that fits your team, and the spec keeps them composable.&lt;/p&gt;

&lt;p&gt;V4 is the format growing to match that reality. The proposals support AI and streaming workloads as first-class citizens, not as workarounds bolted onto a batch design.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bz9it01iaeattd90ju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bz9it01iaeattd90ju.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What practitioners should do about v4 right now
&lt;/h2&gt;

&lt;p&gt;The temptation with a horizon spec is to either ignore it or to over-anticipate it. Both are mistakes. Here is a grounded way to think about it.&lt;/p&gt;

&lt;p&gt;Run v3 in production. It is the current standard, and it carries the features most teams actually need today, including deletion vectors, the variant type, geospatial types, and row lineage. Build new tables on v3 and get comfortable with its capabilities. Do not wait on v4 features that have no committed timeline.&lt;/p&gt;

&lt;p&gt;Watch the proposals that map to your pain. If you run streaming ingestion and you fight commit latency and small files, the adaptive metadata tree is the proposal to track. If you run wide ML feature tables and you burn money rewriting rows to update a few columns, follow the efficient column updates work in issue 15146. If you operate across regions and you dread table migrations, relative paths will change your operational life. Knowing which proposal solves your specific problem tells you which dev list threads to read.&lt;/p&gt;

&lt;p&gt;Pay attention to the catalog decision now, since it does not wait on v4. The catalog has become a Tier-1 architecture choice. It is the control plane for governance and the thing that determines whether your data can be governed, optimized, and shared consistently across engines. Pick the catalog that fits your governance model and keep your governance boundary clear. That decision compounds over years, and a wrong choice creates operational debt that grows with every table you add.&lt;/p&gt;

&lt;p&gt;Follow the source, not the summaries. The authoritative status of any v4 feature lives in the Apache Iceberg GitHub repository, the design documents linked from the issues, and the dev mailing list. Blog posts and conference recaps, including this one, are a starting point. The vote happens in the open, and the spec is the final word.&lt;/p&gt;

&lt;p&gt;Keep perspective on convergence. The Iceberg v4 and Delta 5.0 shared metadata story is real and worth understanding, but it is a proposal under community review, not a settled fact. Treat it as a direction to watch rather than a plan to build on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3enlcv0yhur53porwz85.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3enlcv0yhur53porwz85.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of what comes next
&lt;/h2&gt;

&lt;p&gt;Iceberg v4 is not one feature. It is a coordinated redesign of the metadata layer, broken into proposals that each solve a concrete operational problem. The adaptive metadata tree makes commits cheap and fast. Parquet metadata makes planning fast as metadata gets richer. Typed statistics make stats reliable and extensible, and they open the door to vector search. Relative paths make tables portable. Column families make wide AI tables practical to update. The Delta convergence proposal asks whether two formats can share one foundation.&lt;/p&gt;

&lt;p&gt;These proposals reinforce each other. Cheap commits enable column updates. Columnar metadata holds typed stats. The pieces fit because they came from the same insight. Iceberg succeeded so completely that people now push it far past its original design, and the format has to evolve to hold that weight.&lt;/p&gt;

&lt;p&gt;The debates are not noise. They are the system working. The questions about scan cost in the adaptive tree, about whether column families belong in Parquet or Iceberg, about whether the community accepts convergence, these are the conversations that turn a good proposal into a durable spec. V4 will arrive after those arguments resolve, not before. That is slower than a single vendor shipping a feature, and it is exactly why the result will be worth building on.&lt;/p&gt;

&lt;p&gt;For now, the practical advice holds. Run v3. Watch v4. Choose your catalog with care. And follow the work in the open, since the people building it are doing it where everyone can see.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go deeper
&lt;/h2&gt;

&lt;p&gt;If you want to understand the data lakehouse and the AI workloads reshaping it at the level this post only gestures at, the best next step is to read the books that cover it end to end. Alex Merced has written multiple hands-on books on Apache Iceberg, the agentic lakehouse, modern data architecture, and AI-assisted data work. They take you from the metadata internals through to building and operating real systems.&lt;/p&gt;

&lt;p&gt;Pick them up at &lt;a href="https://books.alexmerced.com" rel="noopener noreferrer"&gt;books.alexmerced.com&lt;/a&gt; and turn the concepts in this post into working knowledge.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The State of Apache Iceberg Catalogs in June 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Mon, 08 Jun 2026 09:00:18 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/the-state-of-apache-iceberg-catalogs-in-june-2026-265e</link>
      <guid>https://dev.to/alexmercedcoder/the-state-of-apache-iceberg-catalogs-in-june-2026-265e</guid>
      <description>&lt;p&gt;The table format question is settled. Apache Iceberg won. Snowflake, Databricks, AWS, Google, and Microsoft all read and write it, and the open source engines treat it as the default. The interesting fight moved up one layer. The catalog is now the part of the stack that decides whether your lakehouse is governed, interoperable, and ready for the wave of AI agents that want to query it without a human in the loop.&lt;/p&gt;

&lt;p&gt;This is not a small detail. The catalog resolves metadata, controls access, vends credentials, sequences commits, and acts as the single API boundary between every engine and every byte of data you own. Pick the wrong one and you inherit operational debt that grows with each table. Pick well and you get engine freedom, one governance model, and a clean path as the spec evolves.&lt;/p&gt;

&lt;p&gt;June 2026 is a useful moment to take stock. Apache Polaris graduated to a top-level project in February. Snowflake Summit just wrapped with Iceberg v3 going generally available and a Polaris-powered governance layer at the center of the keynote. Databricks set the table for its own summit with a blunt claim that Unity Catalog is the most interoperable Iceberg catalog on the market. A two-year-old Iceberg operations startup got acquired by a security company valued at nine billion dollars. The pieces are moving fast, so here is a clear-eyed map of where every catalog stands, what it does well, where it falls short, and what shipped recently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an Iceberg Catalog Actually Does
&lt;/h2&gt;

&lt;p&gt;An Iceberg table is a pile of Parquet files, metadata files, and manifest lists sitting in object storage. On its own it is inert. The catalog answers the one question that makes it queryable: where is the current &lt;code&gt;metadata.json&lt;/code&gt; for this table? Without that pointer, no engine reads or writes anything.&lt;/p&gt;

&lt;p&gt;Modern catalogs do far more than resolve pointers. They enforce who can read, write, or administer each table and namespace. They vend short-lived, table-scoped storage tokens so engines never hold long-lived cloud keys. They sequence concurrent writers with server-side deconflicting instead of fragile client-side locking. They organize tables into namespaces, track view definitions, and serve as the single point for lineage and audit. The catalog is where governance lives. Everything an engine does passes through it.&lt;/p&gt;

&lt;p&gt;The reason this got interesting in 2026 is the Iceberg REST Catalog specification. Before REST, every engine needed a dedicated connector for every catalog. Spark talked to Hive Metastore one way, Trino talked to Glue another way, and custom tooling talked to an internal catalog a third way. Adding an engine or a catalog meant writing integration code for every pairing. REST collapses that. Implement the REST client once per engine, implement the REST server once per catalog, and the whole thing interoperates over plain HTTP.&lt;/p&gt;

&lt;p&gt;The protocol also opened the door to server-side capabilities the old Thrift-based approach made impossible. Credential vending scopes a leaked token to one table for a few minutes. Remote signing goes further, so the engine never touches credentials at all and the catalog pre-signs each file access. Server-side commit deconflicting retries conflicts on the server. Multi-table commits give atomic visibility across several tables at once. The newest addition is scan planning. The Iceberg 1.11 release added a REST scan planning client, which lets the catalog plan a scan on the server and hand back a filtered plan. That single feature is the foundation for cross-engine access control, because the catalog can apply row filters and column masks during planning and return only the rows an engine is allowed to see.&lt;/p&gt;

&lt;p&gt;Scan planning is the feature to watch this year, so it is worth slowing down on. In the old model, an engine asked the catalog for a table’s metadata, then planned the scan itself by reading manifest files and deciding which data files to touch. The engine saw everything. Server-side scan planning flips that. The engine asks the catalog to plan the scan, and the catalog reads the metadata, applies whatever row filters and column masks the policy says this caller is allowed, and returns a plan that points only at authorized data. The engine never sees what it is not permitted to see, because the filtering happened before the plan existed. That is how a single set of policies, defined once in the catalog, gets enforced across Spark, Trino, DuckDB, and anything else that implements the client. It also offloads expensive planning work from the engine to the catalog, which caches it. Gravitino, Databricks, and Snowflake all built features on this in the last few months, and it is the technical backbone of cross-engine governance.&lt;/p&gt;

&lt;p&gt;Remote signing deserves the same attention for sensitive data. With credential vending, the catalog hands the engine a short-lived token scoped to a table. With remote signing, the engine gets no token at all. Every individual file read is pre-signed by the catalog, scoped to one file and one operation. For regulated data where even a few minutes of broad access is unacceptable, that difference matters, and the catalogs that support it, Polaris, Lakekeeper, and others, are starting to align on the Iceberg 1.11 signer endpoint properties so engines configure it the same way everywhere.&lt;/p&gt;

&lt;p&gt;Every catalog released after 2023 either speaks REST or is racing to add it. The question is no longer whether to use the protocol. The question is which REST implementation fits your stack, and that is what the rest of this piece works through.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iceberg v3 Lands, and v4 Is Already on the Whiteboard
&lt;/h2&gt;

&lt;p&gt;Two format milestones frame the catalog story this year.&lt;/p&gt;

&lt;p&gt;Iceberg v3 reached general availability across the major platforms in the first half of 2026. It adds deletion vectors, which speed up updates, merges, and deletes by marking deleted rows instead of rewriting files. It adds row tracking, which makes incremental processing far cheaper. It adds the VARIANT type, a standard way to store semi-structured data so JSON-shaped payloads stop forcing awkward workarounds. Snowflake, Databricks, and Amazon S3 Tables all confirmed v3 support as generally available, and the catalogs that store the metadata followed. This matters for catalogs because v3 features ride through the catalog API, and not every catalog supports creating v3 tables yet. AWS Glue, for example, still cannot create v3 tables through its REST &lt;code&gt;CreateTable&lt;/code&gt; path even though EMR and Glue ETL can work with them.&lt;/p&gt;

&lt;p&gt;The next frontier is already public. Databricks used its pre-summit blog to announce that Iceberg v4 will rethink the core metadata structure with an adaptive metadata tree, and that it is proposing Delta 5.0 adopt the same structure. The pitch is convergence: one metadata layout that both Delta and Iceberg share, ending the long trade-off between interoperability and production-grade performance. Whether the Iceberg community accepts that direction is an open conversation, and it is the kind of debate that plays out over months on the dev list. For now, treat v3 as the production target and v4 as the horizon worth watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snowflake Summit 2026: Horizon Catalog, Powered by Polaris
&lt;/h2&gt;

&lt;p&gt;Snowflake Summit 2026 ran the first week of June, and the catalog news sat at the center of the keynote rather than buried in a breakout.&lt;/p&gt;

&lt;p&gt;The headline is that Horizon Catalog, Snowflake’s governance and discovery layer, now runs its interoperability on Apache Polaris and enables bi-directional read and write access to Snowflake-managed Iceberg tables from outside engines. That is a real shift. For years, “open” often meant external engines could read Snowflake data but not write it. The bi-directional write path closes that gap. An external Spark or Trino job can now write to a Snowflake-managed Iceberg table through Polaris-implemented open APIs, with Snowflake’s governance applied through the Iceberg REST Scan Plan API so fine-grained protections travel across compatible engines.&lt;/p&gt;

&lt;p&gt;It helps to keep two Snowflake products straight, because the naming confuses people. Snowflake Open Catalog is the managed Apache Polaris service for externally managed Iceberg tables, aimed at cross-engine interoperability with zero self-hosting. Snowflake Horizon Catalog is the governance and discovery layer for Snowflake-managed assets, and its interoperability layer is now built on the same Polaris engine. Snowflake has been explicit that it runs the same Polaris backbone the community downloads, not a stripped-down fork. That is a meaningful commitment in a space where “open” has been used loosely.&lt;/p&gt;

&lt;p&gt;Around the catalog, Snowflake added Horizon Context for an AI and BI context layer, Semantic Studio and Semantic View Autopilot for building shared business logic, and Adaptive Compute for matching resources to AI workloads. It also folded its Natoma acquisition into a set of agent identity and security features. The analyst read from Constellation Research was sharp: Iceberg v3 is table stakes, and the real story is read and write interoperability plus governance, trust, and context for agents. The format war is over, so the platforms are competing on meaning and control instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks Sets the Stage for Its Own Summit
&lt;/h2&gt;

&lt;p&gt;Databricks holds its Data + AI Summit from June 15 to 18, so the biggest stage-show announcements land the week after this writing. The company did not wait, though. It published a detailed Unity Catalog and Iceberg post on May 28 that reads like a marker planted firmly in the ground.&lt;/p&gt;

&lt;p&gt;The claim is direct: Unity Catalog is the most complete and interoperable Iceberg catalog available, and the proof is a batch of capabilities moving to general availability. Managed Iceberg is GA, so you create, read, write, optimize, govern, and share Iceberg tables directly in Unity Catalog with Predictive Optimization and Liquid Clustering handling the tuning. Iceberg v3 is GA, with deletion vectors, row tracking, and VARIANT across managed, foreign, and UniForm-enabled tables. Foreign Iceberg is GA, along with credential vending for foreign Iceberg, so Unity governs and securely queries tables that live in other catalogs. External sharing to Iceberg clients is GA through the open Delta Sharing protocol, with foreign Iceberg sharing in public preview.&lt;/p&gt;

&lt;p&gt;Databricks framed the pitch around five requirements it says define a real Iceberg catalog: open APIs with credential vending, federation across external estates, cross-engine governance, secure and open sharing, and continuous performance and format innovation. The cross-engine governance piece is the technically interesting one. Cross-engine attribute-based access control is in beta, and it works by enforcing column masks and row filters during server-side scan planning through the Iceberg REST scan APIs. Any engine that implements the scan planning client from Iceberg 1.11, such as Spark or DuckDB, gets the same policies applied without a Databricks runtime. New federation connectors in preview extend Unity beyond Glue, Snowflake Horizon, Hive Metastore, and Salesforce Data Cloud to include Google Cloud Lakehouse and Palantir.&lt;/p&gt;

&lt;p&gt;The honest read on Databricks is the same as it has been. The managed Unity Catalog is excellent and deeply tied to the Databricks platform. The open source Unity Catalog under Linux Foundation governance is a separate, slower-moving project with a real feature gap, and you should not assume parity between the two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris: The Community Standard Comes of Age
&lt;/h2&gt;

&lt;p&gt;Apache Polaris is the catalog that gained the most ground in the last year, and the trajectory is worth laying out.&lt;/p&gt;

&lt;p&gt;Snowflake and Dremio co-created Polaris and donated it to the Apache Software Foundation in August 2024. It incubated for 18 months with contributions from Google, Microsoft, Confluent, and dozens of other organizations, and it graduated to an Apache top-level project on February 18, 2026. The 1.0 release shipped in October 2025 with external identity provider support for Okta and Google, a persistent policy store for things like compaction and snapshot expiration, and a downloadable binary plus Helm chart. The 1.4 release in April 2026 was the first post-graduation drop, and it pushed hard on production hardening: storage-scoped AWS credentials, AWS STS session tags so CloudTrail can correlate access, S3 KMS encryption support, CockroachDB as a persistence backend, and Iceberg metrics persistence to the database.&lt;/p&gt;

&lt;p&gt;What Polaris does well is the core a vendor-neutral catalog needs. It implements the Iceberg REST spec fully, including credential vending, server-side deconflicting, multi-table commits, and OAuth2. Its access model uses a clean hierarchy of principals, principal roles, and catalog roles, which decouples identity from permissions and enforces security at the catalog layer no matter which engine runs the query. A single Polaris server manages many logical catalogs, each with its own storage and keys. Catalog federation lets one Polaris instance route to Hive Metastore, Glue, and other Iceberg REST endpoints, so you adopt it incrementally instead of doing a big-bang metadata migration. Generic Tables register non-Iceberg assets like Delta and Hudi alongside Iceberg tables in the same namespace, and the same feature opens a path to storing semantic assets like metric definitions in the catalog itself. Open Policy Agent integration is maturing for teams that want external authorization.&lt;/p&gt;

&lt;p&gt;The recent pull request activity shows where the project is putting its energy. In early June the community merged a credential vending refactor in core, added support for access delegation in &lt;code&gt;registerTable&lt;/code&gt;, and moved event listeners onto a dedicated thread pool so the audit and change-event path does not block commits. There was also cleanup that says a lot on its own: a fix removing the &lt;code&gt;incubator&lt;/code&gt; path segment from binary distribution URLs, the small chores that follow graduation. The forward work the community keeps discussing is the Table Sources proposal, which aims to turn Polaris into a registry for every lakehouse asset, not just tables and views but functions, metrics, and models. If that ships, the catalog becomes the single place every team and every agent looks for governed, semantically rich data.&lt;/p&gt;

&lt;p&gt;The honest limits are real. Polaris is a Quarkus-based JVM service, so the open source path means you run and scale it yourself along with a PostgreSQL, MySQL, or CockroachDB backend. It has no Git-style branching the way Nessie does. And the line between the Apache project and Snowflake’s commercial Open Catalog can blur, so feature parity between the two is not guaranteed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Nessie: Git for Your Catalog
&lt;/h2&gt;

&lt;p&gt;Project Nessie, created by Dremio, takes a different angle that nothing else on this list matches. It brings Git-like semantics to catalog metadata. You create branches, tags, and commits over the entire catalog state, which lets you run isolated experiments, build CI/CD workflows for data, and roll the whole catalog back to a previous commit.&lt;/p&gt;

&lt;p&gt;The branching is the point. You spin up &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, and &lt;code&gt;feature&lt;/code&gt; branches of your catalog, write to a branch in isolation, then merge when the work is ready. That is genuinely useful for testing schema changes, validating a backfill, or doing feature engineering against production data without touching live tables. Catalog-level time travel gives you a global undo across every table at once, not just per-table snapshots. Merges provide atomic visibility, and cherry-pick works exactly like it does in Git. Nessie implements the Iceberg REST interface, so engines connect over the standard protocol, and the 0.107.5 release in April 2026 added Spark SQL 4.0 extensions for branch and tag management.&lt;/p&gt;

&lt;p&gt;The limits keep Nessie in a specialist role rather than a default. It has no built-in fine-grained access control, so production deployments pair it with Polaris, an OPA layer, or a custom authorization service. It does not vend credentials, so engines bring their own storage access. And the branching itself is only worth the operational overhead if your workflows actually benefit from data CI/CD. For a team that just needs metadata resolution and access control, branch management is complexity without payoff. The merges also provide atomic visibility rather than true multi-statement ACID, which is a distinction worth understanding before you design around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Gravitino: The Federated Metadata Lake
&lt;/h2&gt;

&lt;p&gt;Apache Gravitino is the most ambitious project in this group, and it frames itself as more than an Iceberg catalog. It calls itself a federated metadata lake, a single layer for tables, files, models, Kafka topics, and UDFs across many backend systems. It graduated to an Apache top-level project in June 2025, shipped 1.0, and reached 1.2.0 on March 13, 2026.&lt;/p&gt;

&lt;p&gt;The breadth is the selling point. Gravitino connects to Hive, MySQL, PostgreSQL, HDFS, S3, Iceberg, Hudi, Paimon, ClickHouse, StarRocks, OceanBase, and more through one API, with changes reflected through direct connectors instead of ETL-based metadata sync. It runs a native Iceberg REST endpoint so any REST-compatible engine treats it as an Iceberg catalog. The 1.2.0 release added a Table Maintenance Service that schedules table health work proactively, a ClickHouse catalog for governing real-time analytics next to the lakehouse, end-to-end UDF management, authorization for Iceberg view operations, a redesigned web UI, and scan planning offload so engines like DuckDB and Spark delegate planning to Gravitino’s IRC server. The project also leaned into AI-native metadata in 2025 with a Model Catalog, an MCP server to connect agents to data context, and a Lance REST service for vector data.&lt;/p&gt;

&lt;p&gt;The recent pull requests reinforce the federation-first identity. In early June the community merged Flink connector view support for Iceberg and Paimon catalogs, a Glue catalog UI in the new web console, support for complex types in Iceberg tables managed through Glue, and REST catalog backend HTTP timeout configs. These are the connector and integration fixes a project ships when its job is to sit in front of many systems at once.&lt;/p&gt;

&lt;p&gt;The limits follow from the ambition. Documentation lags the feature set, especially around production hardening. Running Gravitino means operating a JVM server, its connector layer, and the federation topology, which is a large configuration surface. Engine integration is most mature for Trino, with Spark and Flink progressing but not at parity. And if you only need an Iceberg catalog, Gravitino is more machine than the job requires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lakekeeper: The Lightweight Rust Option
&lt;/h2&gt;

&lt;p&gt;Lakekeeper is the youngest catalog here and the most opinionated about staying small. It is written entirely in Rust and ships as a single binary with no JVM and no Python. Point it at a PostgreSQL database and it serves REST requests in milliseconds, which makes it a natural fit for containers and Kubernetes.&lt;/p&gt;

&lt;p&gt;It implements the full Iceberg REST spec, including multi-table commits, server-side deconflicting, and table and view statistics. Storage access uses vended credentials and remote signing across S3, GCS, ADLS, and on-premise S3-compatible stores. Authorization runs on OpenFGA by default with an OPA bridge for Trino, and authentication accepts any OIDC provider plus native Kubernetes service account auth. A single deployment serves many isolated projects and warehouses, and built-in CloudEvents emission lets you react to table changes by triggering compaction or feeding a CDC pipeline. The 0.12.0 release in April 2026 concentrated on authorization, adding an audit event handler with exactly-once guarantees, OPA batch optimization, Trino custom rule extensions, configurable admin users, and better role lifecycle management.&lt;/p&gt;

&lt;p&gt;The recent pull requests show the same focus sharpening. In early June the project added a role-membership backend with role-in-role nesting and bounded nesting depth at write time, published support for Cedar policies including a &lt;code&gt;global_role_ids&lt;/code&gt; requirement, and started emitting the Iceberg 1.11 &lt;code&gt;signer.uri&lt;/code&gt; and &lt;code&gt;signer.endpoint&lt;/code&gt; properties so remote signing lines up with the latest spec. There was also a fix to retry transient failures when acquiring storage OAuth tokens, the kind of reliability work that matters at scale.&lt;/p&gt;

&lt;p&gt;The limits are mostly about maturity and scope. It is a young project with a smaller community, so production deployment stories are still accumulating. It has no branching. PostgreSQL is the backing store unless you implement the storage trait yourself. And it has been validated most with Spark, PyIceberg, Trino, and StarRocks, with Flink and Hive less proven. For teams that want a fast, dependency-light catalog with strong authorization, though, it is a strong pick. A commercial Lakekeeper Plus edition from Vakamo adds enterprise maintenance and snapshot management, and Red Hat certified it for OpenShift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unity Catalog Open Source: The Other Half of the Story
&lt;/h2&gt;

&lt;p&gt;The managed Unity Catalog is a Databricks product, but the open source Unity Catalog is its own project under Linux Foundation governance, and it deserves a separate look because the two move at different speeds.&lt;/p&gt;

&lt;p&gt;The open source pull request activity in late May and early June tells you the project is converging on Delta-first managed tables while keeping the Iceberg REST path. Recent merges made the Delta REST API enabled by default, enabled managed tables by default with &lt;code&gt;server.managed-table.enabled=true&lt;/code&gt;, added support for column default values, enforced case-insensitive Delta column names, and turned on credential-scoped filesystem access by default in the Spark connector. A run of changes renamed and tightened the Delta API contract. The direction is a more opinionated, batteries-included server that works out of the box rather than requiring deep configuration.&lt;/p&gt;

&lt;p&gt;The takeaway holds steady. If you run Databricks, the managed Unity Catalog is the natural and often mandatory choice, with Predictive Optimization, Liquid Clustering, and AI asset governance you do not get elsewhere. If you run the open source version off-platform, expect a real feature gap and plan around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Managed and Cloud-Native Catalogs
&lt;/h2&gt;

&lt;p&gt;Self-hosting is not the only path, and for many teams it is the wrong one. The cloud providers all ship managed catalog services that trade portability for zero operations.&lt;/p&gt;

&lt;p&gt;Snowflake Open Catalog is the managed Apache Polaris service. You get the same REST API, RBAC, and credential vending as the open source project with nothing to host. It is generally available and free today, with pay-per-request billing planned for later in 2026. For teams that want Polaris without operating a JVM service, it is the path of least friction, and it stays vendor-neutral because the underlying project is.&lt;/p&gt;

&lt;p&gt;AWS gives you two related options. The AWS Glue Data Catalog is the long-standing managed, serverless metadata service, deeply tied to IAM, Lake Formation, Athena, EMR, and Redshift. It added an Iceberg REST endpoint in late 2024, so external engines connect without Glue-specific SDKs. The limits are well known: it is AWS-only with no built-in cross-cloud federation, it supports a single level of namespace nesting, it has no branching or multi-table commits, and its REST surface has gaps. &lt;code&gt;UpdateTable&lt;/code&gt; is not supported for Iceberg tables through the REST API, v3 tables cannot be created through the REST &lt;code&gt;CreateTable&lt;/code&gt; path, and the REST endpoint does not vend credentials. The newer option is Amazon S3 Tables, which are first-class AWS resources that expose the Iceberg REST Catalog API and deliver up to ten times higher transactions per second than Iceberg tables in general-purpose buckets. S3 Tables now support Iceberg v3, include table-level access control and built-in maintenance, and integrate with SageMaker Lakehouse for unified governance and fine-grained access control. The open source S3 Tables Catalog client library bridges the control-plane operations to engines like Spark.&lt;/p&gt;

&lt;p&gt;Google BigLake Metastore is a serverless, managed Iceberg REST catalog on GCP. It supports interoperability between Spark, Trino, and BigQuery on the same tables in Cloud Storage, and it includes BigQuery federation so a table created in Spark is queryable in BigQuery without a copy. Microsoft Fabric OneLake Catalog manages metadata for tables across Fabric workspaces with Delta and Iceberg support, tightly bound to the Fabric platform.&lt;/p&gt;

&lt;p&gt;Streaming sources are part of this picture too, and they are easy to forget. Confluent’s Tableflow materializes Kafka topics directly as Iceberg tables and registers them in a catalog, so the data an application produces lands in the lakehouse as a governed Iceberg table without a separate batch pipeline. Confluent was one of the original Polaris contributors, and the pattern matters because it means the catalog is no longer fed only by batch ETL. Real-time data writes straight into it. Any catalog you choose has to handle a write path that includes streaming ingestion, not just nightly jobs, and the ones with server-side commit deconflicting handle the concurrent writes that streaming produces far better than the ones without it.&lt;/p&gt;

&lt;p&gt;Dremio also offers a managed Polaris-based catalog as part of its platform, called Open Catalog. It gets its own section below, because the changes there over the last six months are substantial enough to treat on their own.&lt;/p&gt;

&lt;p&gt;For completeness, the Iceberg project also ships a JDBC catalog that stores metadata pointers in any JDBC-compatible database. A SQLite-backed JDBC catalog is excellent for local development, unit tests, and CI because it needs no cloud services. A PostgreSQL-backed one works for single-writer or moderate-concurrency production. It is not a REST catalog, though, so engines need JDBC drivers on the classpath, and you get no credential vending, no server-side deconflicting, and no multi-table commits. Treat it as a stepping stone, not a destination.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dremio: The Agentic Lakehouse Built on Polaris
&lt;/h2&gt;

&lt;p&gt;Dremio sits in an unusual spot in this map. It co-created Apache Polaris and Apache Arrow, it is one of the most active Polaris contributors, and its Open Catalog uses Polaris at the core rather than a separate fork. So when you adopt Dremio’s catalog, you adopt the same open standard the community governs, with Dremio’s platform built around it. That framing matters for what changed over the last six months, because Dremio spent the period turning its catalog from a managed metadata service into the center of an autonomous, agent-first platform.&lt;/p&gt;

&lt;p&gt;The repositioning came at the Subsurface conference in November 2025, when Dremio relaunched Dremio Cloud as “the Agentic Lakehouse,” described as built for agents and managed by agents. The pitch puts AI agents as a first-class operator of the platform rather than a copilot bolted onto the side, and the catalog is the foundation it all sits on. Through the first half of 2026 the company shipped the pieces that back the claim.&lt;/p&gt;

&lt;p&gt;Start with the catalog itself. Open Catalog is managed Polaris, provisioned the moment you start, so you get RBAC, credential vending, and the Iceberg REST spec without operating a JVM service. Dremio extends it with fine-grained access control through UDFs, which adds row-level security and column masking that travel with the data across every access path, not just inside one engine. Its query federation engine connects databases, warehouses, and external catalogs such as PostgreSQL, Snowflake, BigQuery, Glue, and Unity Catalog into the same governed namespace, so the catalog governs more than Iceberg tables. On top, the AI Semantic Layer lets teams build curated SQL views in Bronze, Silver, and Gold tiers with wikis, tags, and AI-generated metadata, which is the business context an agent needs to turn a vague question into a correct query.&lt;/p&gt;

&lt;p&gt;The autonomous side is where the last six months added the most. Dremio Cloud now runs an active metadata system that watches query patterns, data relationships, and usage trends to make optimization decisions on its own. It automatically builds performance materializations through Reflections and rewrites incoming SQL in real time to hit sub-second response. It reorganizes physical data layouts through automated clustering based on access patterns. And it runs compaction and table maintenance on the Iceberg tables in the catalog without a human scheduling the jobs. This is the same operational layer the rest of this piece keeps pointing at, the work catalogs historically do not do, folded directly into the platform.&lt;/p&gt;

&lt;p&gt;Two open-standard milestones in the window reinforced the position. Polaris graduated to a top-level Apache project in February 2026, which hardened the open core under Dremio’s Open Catalog, and Dremio used the moment to highlight new community appointments and its continued contribution pace. In April 2026, Dremio brought Iceberg v3 support to general availability in Dremio Cloud, putting deletion vectors, row tracking, and VARIANT in reach for its users at the same time the other major platforms shipped v3. The company also leaned on its own research, a 2026 State of the Data Lakehouse and AI report, where 65 percent of organizations named agentic analytics a top priority for the year and 70 percent pointed to siloed data and weak governance as the main obstacles to getting value from AI. That data is the argument for the whole agentic pitch.&lt;/p&gt;

&lt;p&gt;The agent connectivity story is worth calling out on its own. Dremio Cloud natively supports the Model Context Protocol, so any MCP-enabled agent from Anthropic, OpenAI, or Google connects to the catalog and semantic layer through a standard interface. It also ships its own AI Agent for business users and analysts to ask questions and get answers and visualizations directly. Both paths read the same governed catalog and the same semantic definitions, which is the point of putting meaning in the catalog rather than in each tool.&lt;/p&gt;

&lt;p&gt;The honest framing is the same one that applies to every managed platform here. Dremio’s value is the integration: catalog, federation, semantic layer, autonomous optimization, and agent access in one place, so you do not assemble five tools and wire them together. The trade is platform coupling. The mitigating factor specific to Dremio is that the catalog core is open Polaris and the tables are open Iceberg, so the lock-in is lighter than a proprietary catalog and you can point other engines at the same data. For teams that want the autonomous and agentic capabilities without building them, that integration is the draw. For teams that want only a bare catalog, Open Catalog is more platform than the job needs, and self-hosted Polaris is the leaner path.&lt;/p&gt;

&lt;p&gt;Here is the thing almost every catalog comparison skips. None of these catalogs tell you whether a table is healthy. They resolve pointers and enforce access. They do not track orphan files piling up, manifests that need consolidation, snapshot history eating storage, or a compaction schedule falling behind ingestion. A catalog tells you where the data is and who can touch it. It does not keep the data fast.&lt;/p&gt;

&lt;p&gt;That gap is closing from two directions, and watching how is one of the clearest signals about where the market is going.&lt;/p&gt;

&lt;p&gt;The first direction is catalogs absorbing maintenance. Gravitino 1.2.0 shipped a Table Maintenance Service. Databricks built Predictive Optimization and Liquid Clustering into Unity Catalog so maintenance runs based on access patterns. AWS S3 Tables include automatic compaction. Polaris added a policy store for compaction and snapshot expiration in 1.0. The catalog is slowly becoming the place where table health gets managed, not just where metadata lives.&lt;/p&gt;

&lt;p&gt;The second direction is a dedicated operational tier that sits next to the catalog. This is where the year’s most telling acquisition comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Ryft Acquisition Signals
&lt;/h2&gt;

&lt;p&gt;On April 23, 2026, Cyera acquired Ryft. The price was not disclosed, but Israeli press put it between 100 and 130 million dollars, a strong return on Ryft’s eight million dollar seed round and a notable outcome for a company founded only in 2024.&lt;/p&gt;

&lt;p&gt;Ryft built an automated Iceberg management platform. It monitored an entire Iceberg lakehouse, detected tables with too many small files or partition schemes that forced wasteful scans, and ran compaction and layout optimization based on actual usage patterns, with claims of cutting query times and costs by up to ten times. It also handled snapshot lifecycle policies, automated data retention, and GDPR-style compliance deletion, the operational chores that keep a lake healthy and audit-ready. In early 2026 it added a Lakehouse Context Layer that turned the signals it already collected, schema, query patterns across engines, freshness, and statistics, into agent-readable context for every table.&lt;/p&gt;

&lt;p&gt;Cyera is not a data analytics company. It is an AI security platform valued at nine billion dollars after a recent 400 million dollar Series F, focused on data security posture management for the age of autonomous agents. It bought Ryft to extend its control plane into the data lake layer, where agents increasingly operate, and Ryft’s CEO is now leading AI security efforts at Cyera. Read that again. A security vendor paid nine figures for an Iceberg operations startup so it could give AI agents traceable, governed, secure access to lakehouse data.&lt;/p&gt;

&lt;p&gt;That tells you two things. Iceberg table operations, the compaction and lifecycle work catalogs do not handle, is now valuable enough that a security giant pays a premium for it. And the reason is agents. The lakehouse is becoming the place agents read and write, and whoever controls the operational and security layer around the catalog controls how safely that happens. Independent operational vendors like LakeOps make the same bet from a different angle, connecting to existing catalogs and adding autonomous maintenance on top. The catalog resolves metadata and access. Something else has to keep the tables healthy and keep the agents honest. That layer is now contested ground.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Catalog Is Becoming the AI Control Plane
&lt;/h2&gt;

&lt;p&gt;Step back from the individual products and a pattern is obvious. Every catalog roadmap in 2026 is bending toward AI agents, and the bending is reshaping what a catalog is.&lt;/p&gt;

&lt;p&gt;Start with what agents need. A human analyst who writes a wrong query notices the result looks off and fixes it. An agent querying tables at high frequency, without review, does not. It needs the catalog to carry enough context that a generic question produces a correct answer: what a metric means, how a table is joined, which rows a caller is allowed to read. That pushes three things into the catalog that used to live elsewhere.&lt;/p&gt;

&lt;p&gt;The first is semantics. Polaris stores Iceberg SQL view definitions, so the meaning of “active customer” lives in the catalog and every engine reads the same definition. Its Generic Tables feature lets teams register metric definitions, ownership, and lineage as governed assets next to the data. The Table Sources proposal aims to extend that to functions, metrics, and models. Snowflake added Horizon Context and Semantic Studio for the same reason. The catalog is turning into the place business meaning is stored, not just table locations.&lt;/p&gt;

&lt;p&gt;The second is machine-readable access. Gravitino shipped an MCP server in 2025 so agents connect to data context through the Model Context Protocol, and a Model Catalog and Lance REST service for vector data. The acquired Ryft platform built a Lakehouse Context Layer that turned table usage signals into agent-readable context. The direction is the same across vendors: the catalog should expose itself to an agent the way it exposes itself to a query engine, through a standard interface that carries context, not just metadata.&lt;/p&gt;

&lt;p&gt;The third is governance that holds when the caller is not a person. Cross-engine attribute-based access control through scan planning is the clearest example. When an agent shifts identity based on the task and the chain of delegation, as Cyera described when it bought Ryft, the old model of trusting the engine breaks down. Enforcing row filters and column masks during server-side planning means the policy holds no matter which agent or engine asks. That is why a security company paid nine figures for an Iceberg operations startup. The catalog and the layer around it are becoming the control plane for how agents touch enterprise data, and whoever owns that owns a lot.&lt;/p&gt;

&lt;p&gt;This is the real reason the catalog question got urgent. A catalog used to be plumbing. In an agent-driven lakehouse it is the place trust, meaning, and access all converge, and the products are racing to become that convergence point.&lt;/p&gt;

&lt;p&gt;For all the progress, two hard problems sit unsolved across the field.&lt;/p&gt;

&lt;p&gt;The first is governance portability. Access control policies live in the catalog, and there is no industry standard for sharing them across catalogs. Set up row-level security in Unity Catalog and that policy does not transfer to Polaris. Define namespace grants in Polaris and they do not apply when the same table is read through Glue. The practical answer most architects reach is to pick one catalog as the governance boundary and route every engine through it, rather than running several catalogs with duplicated and inevitably inconsistent rules. Federation features in Polaris, Unity, and Gravitino help by centralizing the access layer even when metadata lives in distributed backends, and the Iceberg REST scan planning APIs are starting to make cross-engine policy enforcement real. But there is still no portable policy format, and until there is, multi-catalog governance stays a manual, error-prone job.&lt;/p&gt;

&lt;p&gt;The second is the gap between open and managed. Every major vendor now ships an open source catalog and a managed one, and the managed version is consistently more capable. Unity Catalog open source trails the Databricks version. Snowflake and Dremio Open Catalogs tracks Apache Polaris closely, which is the healthiest case, but the surrounding Horizon Catalog features are Snowflake’s own. The word “open” carries weight in this space, and the careful move is to check whether the open project is the same code the vendor runs in production or a slower sibling. Polaris graduating to a top-level project with Snowflake stating it runs the same backbone is the strongest version of that promise so far. It is also the exception worth holding others against.&lt;/p&gt;

&lt;p&gt;The third is operational reliability, and it is the one teams underestimate until it bites. The catalog is a Tier-1 dependency. If it goes down, no engine resolves metadata, and every read and write across the lake stops at once. That is a different blast radius than a single failed query. The catalogs vary widely in how ready they are for this. The managed services handle availability for you, which is most of why teams pick them. The self-hosted options put it on you: run the JVM service or the Rust binary with replication, back up the persistence layer, monitor P99 latency with a target under half a second, and plan failover before you need it. The newer projects have fewer battle-tested deployment stories, which is a real consideration for a service this central. Whatever you choose, treat the catalog with the same seriousness you treat a production database, because functionally that is what it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Choose in 2026
&lt;/h2&gt;

&lt;p&gt;There is no single right answer, and anyone who tells you otherwise is selling something. The choice comes down to your constraints, your existing stack, and which trade-offs you accept.&lt;/p&gt;

&lt;p&gt;If you live entirely on AWS and want zero operations, Glue or S3 Tables is the path of least resistance, and you accept the cloud coupling. If you want a vendor-neutral, multi-engine, multi-cloud catalog and you are willing to run a JVM service or use a managed Polaris offering, Apache Polaris is the community standard, available self-hosted, through Snowflake Open Catalog, or as the core of Dremio’s Open Catalog. If your workflows need branch isolation and data CI/CD, Nessie is the only option for Git-style version control, and you pair it with a policy layer for production security. If you are a Databricks shop, Unity Catalog is the natural and usually mandatory choice. If you have a heterogeneous platform with Hive here, PostgreSQL there, and Kafka somewhere else, Gravitino unifies the metadata under one API. If you want a fast, dependency-light catalog on Kubernetes with strong authorization, Lakekeeper is the cleanest pick. On GCP, BigLake Metastore is the managed default. And for local development, the SQLite JDBC catalog costs nothing and runs anywhere.&lt;/p&gt;

&lt;p&gt;For most organizations the realistic path is not one catalog forever. You run Glue for existing AWS workloads, add Polaris for multi-engine access, and use Nessie for a development environment that needs branch isolation. The REST protocol makes that coexistence practical, and federation in Polaris, Unity, and Gravitino makes it manageable.&lt;/p&gt;

&lt;p&gt;If there is one position worth holding firmly, it is this: bet on a REST-compatible implementation. Start with REST and you can swap catalog backends later without touching engine configuration. Start with the old Thrift-based Hive Metastore and you inherit a migration the day you outgrow it. That flexibility is worth more than any single feature on any single vendor’s slide.&lt;/p&gt;

&lt;p&gt;The format war ended. The catalog war is just getting good. By the time Databricks finishes its summit on June 18, the v3 wave will be fully GA, the v4 and Delta 5.0 convergence debate will be in full swing, and agents will be querying more of these tables than people are. The teams that win the next two years are the ones who treat the catalog as the Tier-1 decision it has become, keep their governance boundary clear, and remember that resolving metadata is only half the job. Keeping the tables healthy and the agents accountable is the other half, and that half is still up for grabs.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Why Dremio's Value Is Unique to Apache Iceberg Lakehouses and Agentic Analytics</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 04 Jun 2026 19:16:34 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/why-dremios-value-is-unique-to-apache-iceberg-lakehouses-and-agentic-analytics-3oln</link>
      <guid>https://dev.to/alexmercedcoder/why-dremios-value-is-unique-to-apache-iceberg-lakehouses-and-agentic-analytics-3oln</guid>
      <description>&lt;p&gt;Most data teams have already made two decisions, even if they haven't written them down yet. The first is that Apache Iceberg will be the table format their analytical data lives in. The second is that AI agents will be querying that data, not just dashboards and analysts. The Apache Iceberg lakehouse and agentic analytics aren't separate initiatives. They're two halves of the same architecture, and the teams that treat them that way will get to trusted AI years ahead of the teams that don't.&lt;/p&gt;

&lt;p&gt;Here's the problem. The path between "we run a warehouse and some databases" and "agents answer business questions against governed Iceberg tables" is full of blockers. Migration risk. Table maintenance. Semantic context for AI. Mountains of unstructured documents. Most vendors solve one of these and leave you to stitch together the rest from three or four other products.&lt;/p&gt;

&lt;p&gt;Dremio is built to take you through all four. Its federated query engine lets you start before you migrate anything. Its autonomous management runs the Iceberg lakehouse for you. Its AI Semantic Layer, built-in AI Agent, MCP server, and CLI give agents governed access with real business meaning. And its AI Functions turn PDFs sitting in object storage into Iceberg tables with a single SQL statement.&lt;/p&gt;

&lt;p&gt;This post walks through why the Iceberg lakehouse and agentic analytics matter, what blocks teams from getting there, and how Dremio removes each blocker in order.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filisik7ol10bqu22w6fr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filisik7ol10bqu22w6fr.png" alt="This post walks through why the Iceberg lakehouse and agentic analytics matter, what blocks teams from getting there, and how Dremio removes each blocker in order." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Want an Apache Iceberg Lakehouse and Agentic Analytics
&lt;/h2&gt;

&lt;p&gt;Start with the lakehouse half. The argument for storing your analytical data in Apache Iceberg tables on your own object storage comes down to three things: interoperability, cost, and control.&lt;/p&gt;

&lt;p&gt;Interoperability is the big one. Iceberg is an open table format with a published spec and a REST catalog standard. When your tables live in Iceberg, any compliant engine can read and write them. Dremio, Spark, Flink, Trino, Snowflake, and dozens of other tools all speak Iceberg now. That means you pick the best engine for each workload instead of the engine your storage vendor forces on you. Your streaming pipeline can write with Flink while your BI layer queries with Dremio, and both see the same consistent snapshots. No exports. No copies. No format conversion tax.&lt;/p&gt;

&lt;p&gt;Cost follows directly from that. Object storage like S3, ADLS, or GCS costs a fraction of proprietary warehouse storage, and you only pay for it once. The traditional pattern of copying the same data into a warehouse, a BI extract, and three departmental marts multiplies your storage bill and your governance surface at the same time. One Iceberg copy on cheap object storage, queried in place by whatever engine needs it, collapses that sprawl. You also escape the lock-in math where leaving a vendor means re-platforming years of accumulated tables.&lt;/p&gt;

&lt;p&gt;Control is the quieter benefit. Iceberg gives you warehouse-grade features (ACID transactions, schema evolution, partition evolution, time travel) on files you own, in buckets you control, governed by catalogs built on open standards like Apache Polaris. Your data stays in your storage. That's not a slogan. It's a negotiating position.&lt;/p&gt;

&lt;p&gt;Now the agentic half. Agentic analytics is what happens when AI agents query and act on enterprise data directly instead of waiting for a human to build a dashboard. The payoff is a quicker and far more democratized path to insight. A product manager asks a question in plain language and gets a chart in seconds. An agent monitors revenue anomalies overnight and files a summary before anyone logs in. Amazon's SCOT Finance Analytics team saw what this direction looks like in practice with Dremio, cutting query times from 60 seconds to 4 to 6 seconds and eliminating 60 hours of work per project across more than 1,000 users. When the interface to data becomes a question instead of a ticket queue, the number of people who can get answers grows by an order of magnitude.&lt;/p&gt;

&lt;p&gt;Iceberg is what makes agentic analytics safe to run at that scale. Agents generate far more queries than humans do, with far more variety. They need a substrate that's consistent (so two agents never see two versions of the truth), cheap to scan (because exploratory query volume explodes), and rich in metadata (because snapshot and partition statistics are what let engines and optimizers answer fast without rescanning everything). Iceberg's snapshot isolation, metadata tree, and open access model check every box. Proprietary formats check none of them, because every new agent framework needs a new integration into the walled garden.&lt;/p&gt;

&lt;p&gt;It's worth being specific about which Iceberg features carry the load, because "open table format" undersells what the spec actually provides. Snapshot isolation means every query, human or agent, reads a consistent point-in-time view of a table even while writers commit. Hidden partitioning means consumers write natural predicates like &lt;code&gt;WHERE order_date &amp;gt; '2026-01-01'&lt;/code&gt; and the format handles partition pruning, so agents don't need tribal knowledge about physical layout to write fast queries. Schema and partition evolution mean tables adapt to changing business needs without rewrites or broken readers. Time travel means an agent's answer from last Tuesday can be reproduced exactly, which turns out to matter enormously when an AI-generated number ends up in a board deck and someone asks where it came from. And the Iceberg REST catalog specification means catalogs and engines interoperate through a standard API rather than one-off connectors.&lt;/p&gt;

&lt;p&gt;None of these are exotic features. They're the table stakes of a trustworthy analytical substrate. The difference is that Iceberg delivers them in the open, on your storage, for every engine at once, where warehouses deliver them inside one vendor's walls.&lt;/p&gt;

&lt;p&gt;So the destination is clear: data in Iceberg, agents on top. The question is how you get there without a two-year replatforming project. That's where most teams stall, and it's where Dremio's design choices start to matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Blockers Between You and the Agentic Lakehouse
&lt;/h2&gt;

&lt;p&gt;Talk to any team that's attempted this move and the same four problems come up.&lt;/p&gt;

&lt;p&gt;First, migration itself. Your data lives in a warehouse, a handful of operational databases, and a pile of Parquet folders. Moving it all to Iceberg means rewriting pipelines while hundreds of dashboards and downstream consumers keep depending on the old locations. Big-bang cutovers fail often enough that most architects won't sign off on them, and for good reason.&lt;/p&gt;

&lt;p&gt;Second, ongoing management. An Iceberg lakehouse isn't a set-it-and-forget-it system. Streaming and frequent writes create thousands of small files. Metadata bloats. Old snapshots pile up. Someone has to schedule compaction, clustering, and vacuum jobs, and someone has to build and babysit the materialized views that keep dashboards fast.&lt;/p&gt;

&lt;p&gt;Third, business meaning for AI. An agent pointed at raw tables named &lt;code&gt;tbl_cust_ord_v3&lt;/code&gt; will hallucinate joins and invent metric definitions. Agents need a semantic layer with documented, governed definitions, plus tooling to query it. Buying a separate semantic layer product and building custom agent tooling on top is a six-month project before the first useful answer.&lt;/p&gt;

&lt;p&gt;Fourth, unstructured data. Contracts, invoices, support tickets, and scanned documents hold answers your agents need, but they're not rows in a table. The traditional fix is a separate OCR and extraction pipeline with its own infrastructure, its own failure modes, and its own team.&lt;/p&gt;

&lt;p&gt;Dremio addresses each of these in sequence. Let's take them one at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 1: Migrating Your Data to the Lakehouse Without Breaking Anything
&lt;/h2&gt;

&lt;p&gt;The standard migration playbook is brutal. Stand up the new platform, rebuild every pipeline, repoint every dashboard, run both systems in parallel for months, and pray the numbers match. Conventional modernization projects routinely run 6 to 18 months before users see any value, and the riskiest moment is the cutover itself.&lt;/p&gt;

&lt;p&gt;Dremio replaces that playbook with two capabilities working together: Zero-ETL Federation and the semantic layer.&lt;/p&gt;

&lt;p&gt;Zero-ETL Federation means Dremio queries data where it currently lives. Connect your existing PostgreSQL, SQL Server, Oracle, Snowflake, MongoDB, S3 buckets, and 35+ other source types, and Dremio presents them all behind one SQL interface. A single query can join a customer table still sitting in your warehouse with clickstream events already landed in Iceberg, and the person running it never knows the difference. Dremio pushes predicates and partial work down to each source so federated queries stay efficient rather than dragging full tables across the network.&lt;/p&gt;

&lt;p&gt;The semantic layer is where the migration strategy actually lives. On top of those federated sources, you build virtual views in Dremio that model every one of your use cases: a raw layer of views that standardize each source, a business layer that applies logic and joins, and an application layer that serves specific dashboards, reports, and agents. Your BI tools, notebooks, and AI agents all connect to these views, never to the physical sources underneath.&lt;/p&gt;

&lt;p&gt;That indirection is the whole trick. Once every consumer reads from views, the physical location of the data becomes an implementation detail you can change whenever you want. The migration pattern looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Point a raw view at the legacy source (say, &lt;code&gt;raw.orders&lt;/code&gt; reading from PostgreSQL) and build your business views on top of it.&lt;/li&gt;
&lt;li&gt;Migrate that one dataset to an Apache Iceberg table on object storage on your own schedule, validating row counts and values while the legacy path keeps serving production.&lt;/li&gt;
&lt;li&gt;Update the SQL definition of &lt;code&gt;raw.orders&lt;/code&gt; to select from the new Iceberg table instead of PostgreSQL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every subsequent query, from every dashboard and every agent, now runs against Apache Iceberg. No consumer changed a connection string. No downtime window was negotiated. No end user noticed anything except that queries got faster. Then you move to the next dataset. Week one might be orders, week three might be customers, and the warehouse drains incrementally while production never blinks.&lt;/p&gt;

&lt;p&gt;In SQL terms the swap is almost anticlimactic. Before the migration, the raw view reads from the legacy source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgres_prod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After you've landed and validated the Iceberg copy, you redefine the same view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;lakehouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same name, same columns, same downstream views, new physical home. The next query against any view built on &lt;code&gt;raw.orders&lt;/code&gt; resolves to the Iceberg table. If validation later turns up a discrepancy, rollback is the same one statement pointed back at PostgreSQL. Compare that to a traditional cutover, where rollback means a war room.&lt;/p&gt;

&lt;p&gt;During the transition, federation also means you're never stuck half-migrated. A query can join the already-migrated &lt;code&gt;lakehouse.sales.orders&lt;/code&gt; Iceberg table against a &lt;code&gt;payments&lt;/code&gt; table still in PostgreSQL, and it works exactly like a join between two Iceberg tables. The mixed state that kills most migrations is just another Tuesday for a federated engine.&lt;/p&gt;

&lt;p&gt;Reflections make this migration phase faster than it has any right to be. A Reflection is a precomputed, optimized materialization that Dremio's optimizer substitutes into queries automatically, with no SQL changes from the user. Here's the detail most people miss: Dremio stores Reflections as Apache Iceberg tables on your data lake, even when the anchor dataset is a federated source like PostgreSQL or MongoDB. So during migration, a Reflection on a slow legacy source gives your users Iceberg-backed performance before you've migrated a single byte of that source. Dremio uses Iceberg to speed up your non-Iceberg data. The rest of the industry uses proprietary formats to speed up Iceberg. That inversion tells you a lot about where Dremio's focus sit.&lt;/p&gt;

&lt;p&gt;There's a useful side effect, too. Those Reflections are themselves Iceberg tables built from your legacy sources, which means your acceleration layer doubles as a dress rehearsal for the migration. You learn how your data behaves in Iceberg while the source of truth is still the old system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s3vnyq1ni4vemg5v1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s3vnyq1ni4vemg5v1.png" alt="Apache Iceberg Migration with Dremio" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 2: Managing the Lakehouse So It Doesn't Manage You
&lt;/h2&gt;

&lt;p&gt;Migration gets you to Iceberg. Staying fast on Iceberg is a different job, and historically it's been a thankless one. Tables fragment into small files as writes accumulate. Partition layouts drift away from query patterns. Snapshots and orphan files inflate storage. And acceleration turns into a part-time career: deciding which materialized views to build, scheduling their refreshes, rewriting queries to hit them, and tearing them down when workloads shift.&lt;/p&gt;

&lt;p&gt;Dremio's answer is to make the lakehouse autonomous. The platform watches activity through its Active Metadata system, which continuously analyzes query patterns, data relationships, and usage trends, and then it acts on what it learns without waiting for a human.&lt;/p&gt;

&lt;p&gt;On the storage side, Dremio runs Automated Table Optimization for Iceberg tables in its Open Catalog: compaction to merge small files into well-sized ones, clustering to physically reorganize data layouts around real access patterns, and vacuum to expire old snapshots and remove orphan files. These run as background maintenance jobs. You don't size them, and you don't get paged when a streaming table quietly accumulates 40,000 tiny files, because Dremio already merged them.&lt;/p&gt;

&lt;p&gt;On the acceleration side, the Reflections you used during migration get a serious upgrade once your data is in Iceberg:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous Reflections&lt;/strong&gt; remove the design work entirely. Dremio analyzes your query workload over a rolling seven-day window, figures out which materializations would help, then creates, refreshes, and drops Reflections on its own. It targets queries that take at least a second and skips ones already served by cache, so it spends compute exactly where users feel pain. No one on your team decides what to materialize anymore. The platform does, and it revises that decision as workloads change perfect for a world where agent patterns are changing faster than manual acceleration can keep up with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live Reflections&lt;/strong&gt; kill the staleness problem. Because Iceberg exposes table changes through snapshots, Dremio detects when an anchor table changes (polling as often as every 10 seconds) and triggers a refresh immediately. Scheduled refreshes against unchanged data get recognized as redundant and skipped, so you stop burning compute to rebuild things that didn't change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incremental Refresh&lt;/strong&gt; makes those updates cheap. Dremio reads Iceberg's snapshot metadata to identify exactly which records were added, modified, or deleted since the last refresh, and processes only that delta instead of rebuilding the whole materialization. On a 10-billion-row table where last night's load touched 0.2% of rows, that's the difference between minutes and hours of compute.&lt;/p&gt;

&lt;p&gt;Then there's the caching stack underneath. The query plan cache stores the physical plan of executed queries, so repeated queries (the lifeblood of BI dashboards) skip compilation and go straight to execution. The results cache goes further: deterministic queries on unchanged Iceberg data return prior results instantly, spooled as Arrow files to distributed storage and shared across coordinators and clients, whether the query arrives over the console, JDBC, ODBC, REST, or Arrow Flight. And the Columnar Cloud Cache (C3) keeps frequently accessed columnar data on local NVMe at the executor nodes, cutting up to 90% of object storage I/O costs and turning cloud-storage latency into local-disk speed.&lt;/p&gt;

&lt;p&gt;Stack it up and the operational picture changes shape. Compaction, clustering, vacuum, materialization design, refresh scheduling, and cache management all move from your team's backlog to the platform's job description. Your engineers stop juggling materialized views and start shipping data products. Dremio's claim of 10x data engineering productivity is aggressive, but the mechanism behind it is concrete: the platform absorbed an entire category of recurring work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd1oyjnqkaj51wapdin0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd1oyjnqkaj51wapdin0.png" alt="Dremio's claim of 10x data engineering productivity is aggressive, but the mechanism behind it is concrete: the platform absorbed an entire category of recurring work." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Warehouse Speed on Iceberg, Because Dremio Is Iceberg-Native
&lt;/h2&gt;

&lt;p&gt;A reasonable skeptic asks: can an engine reading open files on object storage really match a warehouse that controls its own proprietary format? With Dremio the answer is yes, and the reason is architectural rather than a bag of tricks. Apache Iceberg is the engine's first-class format. Dremio is Iceberg-native top to bottom.&lt;/p&gt;

&lt;p&gt;That phrase gets thrown around loosely, so let's be precise about what it means here. Most platforms bolted Iceberg support onto an engine designed for something else. They read Iceberg by converting it, mirroring it, or treating it as an external table with reduced features, and you pay a performance tax at the boundary. Dremio took the opposite path. Its query engine reads Iceberg's metadata tree directly for planning, prunes partitions and files from Iceberg statistics before touching any data, executes on Apache Arrow's columnar in-memory format (which was co-created by Dremio founders and founding engineers along with project like Apache Drill, Apache Parquet and Apache Calcite) with LLVM code generation, and writes its own acceleration structures, the Reflections, as Iceberg tables. There is no translation layer because there's nothing to translate. Iceberg in, Arrow through, Iceberg out.&lt;/p&gt;

&lt;p&gt;The numbers Dremio puts behind this: 20x performance on Iceberg tables at the lowest cost, up to 100x faster queries with Reflections, and sub-second response for interactive workloads. Shell processes 6 to 8 billion records in minutes for production forecasting on this stack, with more than 100 concurrent forecasting models running at enterprise scale.&lt;/p&gt;

&lt;p&gt;The strategic point matters more than any single benchmark. Because Dremio's speed comes from Iceberg plus Arrow plus caching rather than from a proprietary format, every performance investment you make stays portable. Your fast tables are still just Iceberg tables that Spark, Flink, or any future engine can read. You never face the choice between performance and openness, which is exactly the choice proprietary-first platforms are designed to force.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 3: Building AI Agents With Solid Business Meaning
&lt;/h2&gt;

&lt;p&gt;Performance and migration are solvable engineering problems. The harder blocker for agentic analytics is meaning. An LLM agent handed raw schema names will guess, and it will guess confidently. Ask it for "monthly active customers" against undocumented tables and you'll get an answer. You just won't get the same answer twice, and neither will the agent your finance team runs.&lt;/p&gt;

&lt;p&gt;The fix is a semantic layer: governed views, documented definitions, consistent metrics, lineage, and business vocabulary that both humans and agents read from the same place. And here's where the typical buying pattern goes wrong. Teams assemble a catalog from one vendor, a semantic layer from another, an agent framework from a third, then spend two quarters writing glue code so the agent can actually use the other two. Every integration is a seam where context leaks and governance breaks.&lt;/p&gt;

&lt;p&gt;Dremio's position is that none of that should be a separate purchase. The AI Semantic Layer, the AI Agent, the MCP server, and the CLI are all parts of the same platform, sharing the same definitions and the same access controls.&lt;/p&gt;

&lt;p&gt;Start with the AI Semantic Layer itself. It's virtual, built from SQL views rather than copies, which means it spans every source Dremio federates. That's worth pausing on, because it breaks a boundary every other semantic layer respects. A semantic layer tied to one warehouse can only give meaning to data inside that warehouse. Dremio's semantic layer gives one consistent set of definitions across your warehouse, your operational databases, your Iceberg lakehouse, and your object storage at the same time. "Monthly Revenue" means one thing whether the underlying bytes sit in Snowflake, PostgreSQL, or an Iceberg table on S3. Wikis document datasets and columns. Labels group related objects. Lineage tracks how every view derives from its sources. And Dremio uses generative AI to help maintain all of it, sampling tables to draft wiki descriptions and labels so the catalog becomes a living encyclopedia for the business rather than a documentation graveyard.&lt;/p&gt;

&lt;p&gt;On top of that context sits the embedded Dremio AI Agent, built into the console and ready out of the box. It's a conversational interface that does real analytical work: it runs semantic search across the catalog (names, wikis, labels, metadata) to find the right datasets, writes and executes SQL grounded in the semantic layer's definitions, generates visualizations you can catalog and revisit, detects patterns and returns narrative insights alongside the charts, explains and optimizes existing SQL, and diagnoses slow jobs. It also helps with the unglamorous work that makes data teams effective: drafting documentation for datasets and working out the SQL to capture the data models you describe in plain language. Every action respects the privileges of the logged-in user, every tool call is auditable in the chat window, and none of it required you to integrate anything.&lt;/p&gt;

&lt;p&gt;The same capabilities extend to agents you build or already use. The Dremio MCP Server exposes the platform through the Model Context Protocol, the open standard for connecting LLMs to tools. Each Dremio Cloud project includes its own built-in MCP server, so Claude, ChatGPT, Gemini, LangChain agents, or your custom agentic application can discover datasets, search the semantic layer for context, fetch schemas, and run governed SQL through tools like RunSqlQuery, GetSchemaOfTable, and RunSemanticSearch. You don't host a connector or design custom tooling. The agent inherits the user's identity and access controls automatically, so an agent can never see data its human couldn't.&lt;/p&gt;

&lt;p&gt;For locally running and terminal-based agents, there's the Dremio CLI, an AI-agent-first command line interface built for coding agents like Claude Code and Codex, and equally at home with local agent runtimes like Claude Cowork, OpenClaw, or Hermes. The CLI covers queries, catalog operations, schemas, Reflections, jobs, and access management, with input validation designed for the reality that an AI will be constructing the commands. Pair it with Dremio's published agent skills and your coding agent becomes a competent lakehouse operator in an afternoon.&lt;/p&gt;

&lt;p&gt;Two more pieces complete the agentic picture, and both are easy to underestimate until an agent program scales.&lt;/p&gt;

&lt;p&gt;The first is governance that travels with the agent. Every path into Dremio (the embedded agent, MCP, the CLI, plain SQL) enforces the same fine-grained and role-based access controls, with OAuth tokens flowing through credential vending all the way to the underlying sources. An agent acting for a regional manager sees that region's rows and nothing else, not because someone wrote agent-specific policy, but because the agent literally is that user from the platform's perspective. When the compliance team asks how you govern AI access to customer data, the answer is one sentence: the same way you govern human access, in the same system, with audit trails on every query.&lt;/p&gt;

&lt;p&gt;The second is performance under agent-scale load. Agents probe. They run schema discovery, sample data, try a query, refine it, and try again, generating a long tail of similar-but-not-identical queries that would flatten a manually tuned acceleration layer. This is precisely the workload Autonomous Reflections were built for: Dremio's analysis targets clusters of similar queries with slight variations, exactly the shape agent traffic takes, and the results cache absorbs the identical repeats. Sub-second answers aren't a luxury for agents. An agent that waits 40 seconds per query takes minutes per reasoning loop, and the experience dies. The acceleration stack from Problem 2 is what makes the agent experience from Problem 3 feel instant.&lt;/p&gt;

&lt;p&gt;Now connect this back to the migration story, because this is the part that changes project plans. Dremio's semantic layer abstracts where data is physically stored. The AI Agent, the MCP server, and the CLI all operate on the semantic layer, not on storage. Which means agentic analytics works on day one, against your federated sources, before you've migrated anything to Iceberg. Your agents answer questions over data still sitting in PostgreSQL and Snowflake using the same governed definitions they'll use after the move. The Iceberg migration stops being a prerequisite for agentic analytics and becomes a performance and cost upgrade that happens underneath agents already in production. Most platforms make you finish the boring project before starting the exciting one. Dremio lets you run them in parallel, and the early agent wins are usually what get the migration funded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdgym84n9c6j2h5akvy7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdgym84n9c6j2h5akvy7.png" alt="Dremio's Agentic Analytics Feature Set" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 4: Unstructured Data Without a Separate OCR Pipeline
&lt;/h2&gt;

&lt;p&gt;Somewhere in your object storage right now there's a folder of PDFs that matters more than half your tables. Invoices. Contracts. Inspection reports. Resumes. Support transcripts. Industry estimates put 80 to 90% of enterprise data in unstructured form, and almost none of it participates in analytics, because getting it into rows traditionally requires a separate extraction stack: OCR services, document parsers, orchestration, error handling, and a pipeline team to keep it all running.&lt;/p&gt;

&lt;p&gt;Dremio's answer is to make documents queryable with SQL. The platform embeds LLM calls directly into the engine as AI Functions: AI_GENERATE, AI_CLASSIFY, AI_COMPLETE, and the table function LIST_FILES. No Python service, no external orchestration, no data leaving your governed environment.&lt;/p&gt;

&lt;p&gt;LIST_FILES is the bridge. Point it at a directory in connected storage (S3, ADLS, GCS) and it returns the files as rows, each with metadata plus a &lt;code&gt;file&lt;/code&gt; struct you can hand to the other functions. It handles PDFs, images, Word documents, text files, and scanned documents through multimodal vision models. AI_GENERATE then extracts whatever you ask for, and its &lt;code&gt;WITH SCHEMA&lt;/code&gt; clause forces the LLM to return typed, named fields rather than a blob of prose.&lt;/p&gt;

&lt;p&gt;Put them together and an extraction pipeline collapses into one statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'path'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;source_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;invoice_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;invoice_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;invoice_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AI_GENERATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Extract vendor name, invoice number, and total amount from this invoice.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;invoice_number&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invoice_data&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIST_FILES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'@company_s3/invoices/2025'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'path'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%.pdf'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read what that statement actually does. It scans a folder of invoice PDFs in S3, extracts three typed fields from each document, and materializes the results as a governed Apache Iceberg table. The documents become rows. The rows become part of the semantic layer. The semantic layer feeds your agents and dashboards. A workload that used to mean standing up a document-processing service now ships in a SQL Runner tab before lunch.&lt;/p&gt;

&lt;p&gt;The other functions round out the toolkit. AI_CLASSIFY constrains the model to one value from a list you supply, which makes it reliable for sentiment labeling, document triage, and routing. AI_COMPLETE handles free-form generation like summaries and descriptions. Model providers are pluggable (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, or Dremio's hosted model), and neither Dremio nor the providers train on your data.&lt;/p&gt;

&lt;p&gt;A few production habits make this scale well. Materialize extraction results with CTAS so you pay for each LLM call once instead of on every dashboard refresh. Layer Reflections on the output tables so downstream queries run at interactive speed with zero additional LLM cost. And use workload management rules to route AI-function queries to a dedicated engine so a big extraction job never slows your BI traffic. All three are configuration, not architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawaleipt8qiu71rl58xg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawaleipt8qiu71rl58xg.png" alt="Dremio working with Unstructured Data" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  One Platform Instead of a Stack of Point Products
&lt;/h2&gt;

&lt;p&gt;Step back and notice what you didn't have to buy in any of the four solutions above.&lt;/p&gt;

&lt;p&gt;You didn't buy a separate virtualization product for migration, then a separate semantic layer to give the data meaning, then a separate catalog to govern Iceberg, then a separate table-maintenance service, then an agent framework, then a text-to-SQL vendor, then a document AI platform. That seven-product stack is a real architecture being sold to real companies right now, and every seam in it is a place where definitions drift, permissions diverge, and projects die in integration.&lt;/p&gt;

&lt;p&gt;Dremio bundles the whole path into one platform with one security model. The federated query engine, the Iceberg-native lakehouse with its Open Catalog powered by Apache Polaris, the AI Semantic Layer, the embedded AI Agent, the MCP server, the CLI, and the AI Functions all share the same views, the same wikis and labels, and the same fine-grained access controls. When the agent answers a question, it's reading the same governed definition your BI dashboard reads. When AI_GENERATE writes an Iceberg table, that table lands in the same catalog the rest of your data lives in, with the same lineage and the same permissions.&lt;/p&gt;

&lt;p&gt;The consolidation shows up on the invoice, too. Every point product in that stack carries its own license, its own infrastructure, and its own specialist headcount, and the integration work between them is paid for in engineering quarters. A single platform on open storage flips the cost structure: one Iceberg copy on object storage instead of duplicated marts, C3 trimming up to 90% of I/O costs, autonomous features replacing manual tuning labor, and a 99.97% uptime SLA on the managed service so reliability isn't another thing your team builds. Lowest cost is part of Dremio's stated value proposition, and the architecture is why the claim holds: you're not paying anyone to store your data twice or to glue your own products together.&lt;/p&gt;

&lt;p&gt;There's also a credibility angle that matters for anything built on open standards. Dremio co-created Apache Arrow and Apache Polaris and is a key contributor to Apache Iceberg. The claim "the only lakehouse built natively on Apache Iceberg, Polaris, and Arrow" isn't marketing applied after the fact. The company helped write the standards the platform runs on, which is the strongest assurance you can get that "open" won't quietly become "open, but" three renewals from now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Value, End to End
&lt;/h2&gt;

&lt;p&gt;Run the whole arc back through the lens of a team that starts today.&lt;/p&gt;

&lt;p&gt;Day one, you connect Dremio to your existing sources and build views. Analysts get one SQL interface across everything, and the built-in AI Agent starts answering natural-language questions against data that hasn't moved an inch. Agentic analytics is live before any migration begins, because the semantic layer abstracts storage.&lt;/p&gt;

&lt;p&gt;Over the following months, you migrate dataset by dataset to Apache Iceberg using the view swap pattern. You update a view definition, every downstream query silently shifts to Iceberg, and no consumer experiences downtime. Reflections (stored as Iceberg tables even for legacy sources) keep everything fast through the transition.&lt;/p&gt;

&lt;p&gt;As tables land in Iceberg, the platform takes over the operations work. Automated compaction, clustering, and vacuum keep storage healthy. Autonomous Reflections design and manage your acceleration layer from observed query patterns. Live and incremental refresh keep materializations current for pennies. The plan cache, results cache, and C3 squeeze latency and I/O cost out of every repeated workload. Your engine runs Iceberg as its first-class format, so you get up to 20x performance on Iceberg tables and up to 100x with Reflections without surrendering openness, and any other Iceberg engine can still read every table.&lt;/p&gt;

&lt;p&gt;Meanwhile your agents multiply. The embedded AI Agent serves analysts in the console. The MCP server plugs Claude, ChatGPT, Gemini, and your custom applications into the same governed context. The CLI puts the lakehouse in reach of coding agents and local runtimes. And AI Functions keep folding the unstructured world (the PDFs, the scans, the contracts) into Iceberg tables those agents can query.&lt;/p&gt;

&lt;p&gt;That's the agentic lakehouse: open Iceberg storage you own, a platform that manages itself, and AI agents with real business meaning, reached incrementally instead of through a leap of faith. Each of the four classic blockers (migration risk, maintenance burden, missing context, unstructured data) turns out to be a feature of fragmented architectures rather than a law of nature. Put the engine, the lakehouse, and the agent layer in one platform and the blockers mostly dissolve.&lt;/p&gt;

&lt;p&gt;The honest caveat is that no platform removes the need for judgment. You still decide what your business metrics mean, which datasets deserve curation first, and where federation should give way to migrated Iceberg storage for heavy workloads. What Dremio removes is everything between those decisions and their execution.&lt;/p&gt;

&lt;p&gt;Here's a concrete way to test the argument. Connect a database and an S3 bucket, build one view that joins them, and ask the AI Agent a business question about the result. That single exercise demonstrates federation, the semantic layer, and agentic analytics in under an hour, on data you haven't migrated. If you want to see what an Apache Iceberg lakehouse with built-in agentic analytics feels like before committing to a migration plan, start a free Dremio trial at &lt;a href="https://dremio.com/get-started" rel="noopener noreferrer"&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If the test holds up, the rollout sequence writes itself. Curate wikis and labels on your ten most-asked-about datasets first, because curating semantic context is the most valuable hour an agent program can spend. Hand the MCP connection to one team that already lives in Claude or ChatGPT and let their usage teach you what context is missing. Pick the slowest, most expensive workload in your warehouse as the first view-swap migration candidate, since that's where Iceberg plus Reflections pays back fastest. Then let Autonomous Reflections and Automated Table Optimization run for two weeks and compare your engineering backlog before and after. Each step is reversible, each one delivers value on its own, and none of them requires the others to finish first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nqmbtn5b5s7uhmr15f7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nqmbtn5b5s7uhmr15f7.png" alt="Dremio end-to-end" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.dremio.com/blog/autonomous-reflections-technical-blog/" rel="noopener noreferrer"&gt;Autonomous Reflections technical deep dive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dremio.com/blog/query-results-caching-on-iceberg-tables/" rel="noopener noreferrer"&gt;Query results caching on Iceberg tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dremio.com/blog/using-dremios-mcp-server-with-agentic-ai-frameworks/" rel="noopener noreferrer"&gt;Dremio MCP server and agentic frameworks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dremio.com/blog/agentic-analytics-semantic-layer/" rel="noopener noreferrer"&gt;How Dremio's semantic layer powers agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Apache Iceberg documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://polaris.apache.org/" rel="noopener noreferrer"&gt;Apache Polaris project&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol specification&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>analytics</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Apache Data Lakehouse Weekly: May 28 - June 4, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:44:30 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-may-28-june-4-2026-3df3</link>
      <guid>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-may-28-june-4-2026-3df3</guid>
      <description>&lt;p&gt;The lakehouse projects spent this week doing two things at once. They lined up a remarkable stack of releases, with DataFusion 54.0.0 in active vote, Polaris 1.6.0 scheduled, Parquet weighing both a format release and a Java release, Arrow preparing Java 20.0.0, and Iceberg's C++ and Rust implementations both planning their next versions. At the same time, the foundation's own infrastructure pushed its way onto two dev lists, as the ASF warned both Iceberg and DataFusion that the shared pool of GitHub-hosted CI runners is running out of headroom. Add a wave of post-1.11 spec design in Iceberg, a major proposal landing in Polaris, and AI agents quietly showing up inside project workflows, and you get one of the busiest weeks of the quarter. This issue also marks a milestone for the newsletter itself: Apache DataFusion joins the regular rotation alongside Iceberg, Polaris, Arrow, and Parquet. The query engine layer has become too central to the lakehouse story to cover only in passing, and as this week makes clear, the DataFusion dev list moves fast enough to earn its own section every week. As always, every claim below links to the source thread on lists.apache.org, so you can follow any discussion straight to the people having it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;The defining story on the Iceberg list right now is what happens after a major release ships. With 1.11.0 out the door in mid-May, the community pivoted hard into spec design, and the column updates work is the center of gravity. Anurag Mantripragada's thread on the &lt;a href="https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0" rel="noopener noreferrer"&gt;column update file representation&lt;/a&gt; drew eight participants and eighteen replies, making it the most active design discussion of the week. The work follows the column updates design document and the original discussion thread, and it tackles the question of how engines write a change to one column without rewriting everything around it. Gábor Kaszab then opened a focused companion thread on the &lt;a href="https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg" rel="noopener noreferrer"&gt;column update metadata representation&lt;/a&gt;, deliberately splitting the metadata question from the file question so each can move at its own pace. Together the two threads show the community doing spec work the right way, with separate surfaces for separate decisions and named owners for each.&lt;/p&gt;

&lt;p&gt;For practitioners, the stakes of this work are easy to state. Today, updating a single column value in Iceberg means writing deletes plus new data files, and engines pay for that in write amplification, especially on wide tables where one changed column drags hundreds of untouched ones through a rewrite. A native column update representation changes the cost model for slowly changing dimensions, privacy corrections, and the feature-backfill patterns machine learning teams run constantly. Getting the file format and the metadata format right, separately and deliberately, is how a change this deep lands without breaking the ecosystem of readers.&lt;/p&gt;

&lt;p&gt;The design wave did not stop there. Xiening Dai opened a discussion on &lt;a href="https://lists.apache.org/thread/7n2rk76tmz9c9596l7pv0cr91c1kojwm" rel="noopener noreferrer"&gt;global snapshot consistency for Iceberg tables&lt;/a&gt;, starting from the isolation levels the spec defines today through the write.delete, write.update, and write.merge isolation-level properties, which accept either snapshot or serializable. His thread pushes the conversation beyond a single table's guarantees, and it is worth watching as engines lean harder on Iceberg for transactional workloads. Ankit Kumar opened a thread on &lt;a href="https://lists.apache.org/thread/781cyhoq5x7rc1k9074pdcf873bjofog" rel="noopener noreferrer"&gt;efficient CDC upserts&lt;/a&gt;, linking back to the existing CDC design document and the prior discussion thread. Change data capture keeps resurfacing on this list because it sits at the junction of everything else, touching row-level deletes, snapshots, and engine integration all at once. And Shekhar Rajak raised a precise spec question on &lt;a href="https://lists.apache.org/thread/jncwontk4xkmt7n5ml0pbgk4x54cwzgo" rel="noopener noreferrer"&gt;Avro encoding for non-zone timestamp types&lt;/a&gt;, tied to PR #16577 and issue #12751, about how timestamp and timestamp_ns values carry the adjust-to-utc=false property in Avro.&lt;/p&gt;

&lt;p&gt;The REST catalog spec kept moving too. Huaxin Gao &lt;a href="https://lists.apache.org/thread/63xpzdp10nd3wgv43z2phz5o9zwwx818" rel="noopener noreferrer"&gt;bumped the vote on adding list and load function endpoints&lt;/a&gt; to the REST spec, reporting that the FunctionIdentifier versus CatalogObjectIdentifier debate is now resolved. PR #16144 merged CatalogObjectIdentifier, clearing the naming question that had stalled the vote. Functions in the REST catalog continue the pattern we saw through May, where the REST surface keeps absorbing capabilities that used to live engine-side.&lt;/p&gt;

&lt;p&gt;A second theme this week was engine version policy, and it shows the cost of living downstream of fast-moving compute engines. Anurag Mantripragada opened a discussion on &lt;a href="https://lists.apache.org/thread/6kmh92wl6qkw08dpgv04bl51v590phbl" rel="noopener noreferrer"&gt;Spark versioning strategy with accelerated Spark releases&lt;/a&gt;. Spark 3.4 support is now removed following the 1.11 release, and the Spark community is proposing a faster release cadence, which forces Iceberg to decide how many Spark versions it can carry at once. Steven Wu opened the matching conversation for &lt;a href="https://lists.apache.org/thread/kd183vz2v2y69v4kwbz5wbjfxvx3gf1f" rel="noopener noreferrer"&gt;Flink version support after Iceberg 1.11.0&lt;/a&gt;, anchored on PR #16517, and it drew eight participants and ten replies. The shape of both threads is the same. Every engine version Iceberg supports costs CI time, reviewer attention, and release testing, and the budget for all three is finite. The tradeoff is the classic one. Support fewer engine versions and the project ships faster, tests less, and strands users on older clusters. Support more and the matrix swallows CI minutes and reviewer hours. Both threads are converging on explicit written policy rather than case-by-case calls, which is the right instinct, because a published support window lets platform teams plan upgrades instead of discovering them in release notes.&lt;/p&gt;

&lt;p&gt;That budget question turned literal in the most active community thread of the week. Robert Thomson wrote to the Iceberg PMC about the project's &lt;a href="https://lists.apache.org/thread/9gorr3b1c18f8yk2fys16knjmnrbkjff" rel="noopener noreferrer"&gt;consumption of ASF shared GitHub-hosted runners&lt;/a&gt;, noting that the foundation introduced its GitHub Actions usage policy in 2024 and that the shared runner pool has been at or near its limit. Twelve participants and fourteen replies later, this is a real planning problem, not a courtesy notice. Iceberg runs one of the largest CI matrices in the data space, across Java, Python, Rust, C++, and Go, and the foundation's capacity ceiling now sits underneath all of it. The likely outcomes are the ones other large projects have already reached for: trimming redundant jobs, gating expensive suites behind labels, leaning on self-hosted or donated runners, and being more deliberate about which platforms get tested on every commit. None of those choices is free, and the fourteen replies show contributors weighing each against the project's reliability bar.&lt;/p&gt;

&lt;p&gt;The language subprojects had a productive week of their own. Junwang Zhao started the &lt;a href="https://lists.apache.org/thread/3vdtx3m4xbcw5htj246yhjt6wrc5rjo8" rel="noopener noreferrer"&gt;release discussion for Apache Iceberg C++ 0.3.0&lt;/a&gt;, working from the roadmap in iceberg-cpp issue #523 and noting that not every roadmap item needs to block the release. The thread gathered four participants and eight replies, and it follows directly from the 0.3.0 conversation that started in late May. On the Rust side, Danny Jones opened the &lt;a href="https://lists.apache.org/thread/y5bvps8cmjv00kntbq0bwd2xrttjo6nt" rel="noopener noreferrer"&gt;tracking issue for iceberg-rust v0.10.0&lt;/a&gt; as an action item from the Rust community sync, pointing contributors at issue #2527 as an open invitation. Jordan Epstein raised the harder structural question in a thread on &lt;a href="https://lists.apache.org/thread/hb12b05y2rxv98641dwdvdjh58po0h8r" rel="noopener noreferrer"&gt;reviewer bandwidth in iceberg-rust&lt;/a&gt;, which drew six participants. The Rust implementation has more contributor energy than reviewer capacity right now, and the thread is an honest attempt to fix that imbalance before it becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;Two more threads round out the week. Noritaka Sekiyama proposed &lt;a href="https://lists.apache.org/thread/vn4gglocg2g40p69mfrrh86qzkn1rr4b" rel="noopener noreferrer"&gt;adding an OpenTelemetry MetricsReporter to iceberg-core&lt;/a&gt;, which exports ScanReport and CommitReport data to any OTLP-compatible backend. The proposal drew seven participants and eleven replies, a strong signal that observability is an underserved need. Iceberg ships built-in reporters today, but OTLP export plugs table metrics into the monitoring stacks teams already run. The idea also fits the broader pattern of the post-1.11 cycle, where the project is investing in operational maturity alongside spec features. Scan and commit metrics that flow to Prometheus, Datadog, or any OTLP backend turn table health from a quarterly audit into a live dashboard, and that is the kind of capability that makes platform teams comfortable betting on Iceberg for their most critical workloads. Samuel Pacheco Cantu asked about &lt;a href="https://lists.apache.org/thread/5tcmqx4fcz3zgd799bqj7krtjf6lrbkp" rel="noopener noreferrer"&gt;relative paths and location resolution&lt;/a&gt; for multi-region replication, where data files live in different storage locations depending on the region, and got six replies of practical guidance. And Kevin Liu, who topped the month's activity with 43 messages, raised a flag on &lt;a href="https://lists.apache.org/thread/vsxpk5glhf38nnm7yx8llh5p1wn5yc20" rel="noopener noreferrer"&gt;Iceberg Summit 2027&lt;/a&gt;. His event coordinator reports that most San Francisco conference venues are already booked for 2027, so the community needs to decide on timing and location now. Ten participants jumped in, which says something about how central the summit has become to the project's annual rhythm. Anyone who watched the 2026 summit's session recordings roll out over the past month knows the event now functions as the community's design checkpoint as much as its showcase, so a venue decision made this summer shapes the project calendar a full eighteen months out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris had the most concentrated burst of activity of any project this week, with 71 messages in the first four days of June alone. The headline is a proposal that has been months in the making. Jean-Baptiste Onofré published the &lt;a href="https://lists.apache.org/thread/c833wkc52v4rt7htnj5jh15y7ktvfjfd" rel="noopener noreferrer"&gt;Polaris Directories proposal&lt;/a&gt;, drafted as PR #4613, after a long arc of discussion that ran through earlier ideas like Table Sources. Directories give Polaris a way to reason about locations and the things that live in them, and the proposal arriving as a reviewable pull request rather than an external document is itself notable. JB explained why in a parallel update to the &lt;a href="https://lists.apache.org/thread/ryl6f9v98nj86gbdvz5nc884hjgcs2pb" rel="noopener noreferrer"&gt;proposal docs as markdown&lt;/a&gt; thread. Given recent advances in AI tools, he is experimenting with writing the Directories proposal as markdown in the repository, where both humans and AI tooling can read, diff, and review it. The process change and the proposal are shipping together as one experiment. It is worth dwelling on why the venue matters. A proposal in an external document lives outside the project's history, with comments that vanish and versions nobody can diff. A proposal as markdown in the repository gets pull request review, line-level comments, a permanent record, and now an audience of AI tools that read repositories natively. If the Directories review goes well, expect this to become the default for how Polaris designs in the open.&lt;/p&gt;

&lt;p&gt;The Polaris Console generated the longest thread of the week. JB is preparing the first release of the Console and asked the community whether the &lt;a href="https://lists.apache.org/thread/4pg0f79hhlobzrp810yx0s4cpxg47y6r" rel="noopener noreferrer"&gt;Console belongs in the main Polaris repository&lt;/a&gt;. Nineteen replies from seven participants worked through the tradeoffs, which mirror every monorepo debate you have ever seen, with release coupling and shared CI on one side and contributor focus on the other. The Console is clearly past the toy stage, because users are already filing real operational reports. Yong Zheng described how the &lt;a href="https://lists.apache.org/thread/v4jqvfdw0myxkgfzb01t2787nz7oyzbw" rel="noopener noreferrer"&gt;Console's single-page-app architecture makes the Kubernetes port-forward workflow fail silently&lt;/a&gt; when the server and console run in the same namespace behind nginx, and the thread collected four replies of diagnosis. Bug reports like this one are a healthy sign. People only find port-forward edge cases when they are actually deploying the thing.&lt;/p&gt;

&lt;p&gt;Release planning settled quickly. EJ Wang &lt;a href="https://lists.apache.org/thread/7gff85orxmyhnky8vh5b4gcv1vlb1f0w" rel="noopener noreferrer"&gt;volunteered as release manager for Apache Polaris 1.6.0&lt;/a&gt; and proposed targeting Friday, June 26. With 1.5.0 having shipped on May 18, that keeps Polaris on the steady monthly-ish cadence it has held all year, and nobody on the thread pushed back on the date. EJ also posted the &lt;a href="https://lists.apache.org/thread/7kp36t8fykg2hvzkr96r5m7cdnnyopj9" rel="noopener noreferrer"&gt;notes from the metrics architecture sync&lt;/a&gt; in the long-running thread on REST endpoints for table metrics and events, keeping that design moving between calls.&lt;/p&gt;

&lt;p&gt;The API design work this week clustered around correctness and interoperability. Huaxin Gao, active on both the Iceberg and Polaris lists this week, asked for wider review of the &lt;a href="https://lists.apache.org/thread/4c8xtj85hvj1w1mxtknj0gk6t09q7mqj" rel="noopener noreferrer"&gt;Idempotency-Key design, converging on Model B&lt;/a&gt;. The contract is simple to state and hard to implement: a retry with the same key must not produce additional side effects. Fifteen replies from five participants dug into the simplified design, and this is exactly the kind of plumbing that makes a catalog trustworthy under flaky networks and aggressive client retries. To see why it matters, picture a create-table call that times out at the client after succeeding on the server. Without idempotency keys, the retry fails with an already-exists error or, worse, triggers duplicate side effects downstream. With them, the catalog recognizes the repeated key and returns the original result. Agents make this urgent, because automated clients retry far more aggressively than humans do, and the word cloud on the Polaris list this month, where agentic sits right next to idempotency, suggests the community sees the same connection. Dennis Huo opened a discussion on &lt;a href="https://lists.apache.org/thread/bsh4m3hvob1z21l14rd81r602mmmw2qz" rel="noopener noreferrer"&gt;adding support for new Open Sharing APIs in Polaris&lt;/a&gt;, motivated by the data sharing use case that keeps coming up as enterprises consolidate lakehouses and catalogs and need to grant access across organizational boundaries. And Adam Szita's thread on &lt;a href="https://lists.apache.org/thread/fdcwd7bl7fopfxxsk0mx964sbcjwnmhn" rel="noopener noreferrer"&gt;Iceberg table encryption support&lt;/a&gt; stayed active into this week, now at six participants and seven replies. Iceberg 1.11 shipped the base table encryption implementation with KMS-based key wrapping, and this thread is working out the catalog's half of that story, since encrypted tables only work end to end when the catalog can manage keys. The thread deserves a close read from anyone running regulated workloads. The split of responsibilities is taking shape, with Iceberg defining how data, delete, manifest, and manifest-list files are encrypted and the catalog deciding how keys are issued, rotated, and scoped to principals. When this lands, an encrypted lakehouse stops being a design exercise and becomes a configuration choice, and Polaris is positioning itself as the place where that choice gets made.&lt;/p&gt;

&lt;p&gt;Under the hood, the SPI work continued. Tornike Gurgenidze opened a focused discussion on &lt;a href="https://lists.apache.org/thread/mdtj1g7nvsq0txdf3gt9n1x2hpx39bld" rel="noopener noreferrer"&gt;storage credential-vending SPI changes&lt;/a&gt; attached to PR #3699, following the broader SPI-surface thread, and Dmitri Bourlatchkov &lt;a href="https://lists.apache.org/thread/97k198k5smsy491n6lns1p6yh9frlvbg" rel="noopener noreferrer"&gt;approved the PR&lt;/a&gt;, confirming it matches the direction the earlier discussions set. Credential vending is the mechanism that lets Polaris hand engines short-lived, scoped storage credentials, so getting this SPI right matters to every deployment. Robert Stupp restarted the &lt;a href="https://lists.apache.org/thread/sp0f0p9qgyfg8qzcq45v5hq2kwq2frvy" rel="noopener noreferrer"&gt;object-storage mock testing&lt;/a&gt; discussion to settle the test-infrastructure question explicitly after the PR review went in several directions at once. Dmitri, who led the month's activity with 52 messages, opened two more maintainability threads out of the community sync: one on the &lt;a href="https://lists.apache.org/thread/4bx31cfbcqfxzgpsddvc9kcfbn9l093y" rel="noopener noreferrer"&gt;future of the regtests code&lt;/a&gt; and one on &lt;a href="https://lists.apache.org/thread/qtmor9cwfmyojjg4hmn3l4msf63twco1" rel="noopener noreferrer"&gt;code organization for Spark 3.x and 4.x&lt;/a&gt;, after the Spark 4 support work in PR #4535 produced a substantial amount of copied code. Adnan Hemani also &lt;a href="https://lists.apache.org/thread/wonydo5hfpxsoym9m4ws1llz9rlshdtt" rel="noopener noreferrer"&gt;resurfaced the OpenLineage proposal&lt;/a&gt; in its own thread so reviewers can find it, and eleven replies suggest the lineage integration has real momentum.&lt;/p&gt;

&lt;p&gt;Step back and the Polaris picture is striking. In four days the project advanced a flagship proposal, debated its UI strategy, scheduled a release, designed idempotency semantics, opened a data sharing track, progressed encryption, and refactored its credential SPI. This is what a catalog community looks like when it is racing to match the pace of the format underneath it. It is also a reminder of how far Polaris has traveled in two years, from incubating project to the coordination point for encryption, lineage, sharing, idempotency, and a console, all moving in parallel under a monthly release rhythm. The catalog used to be the boring part of the stack. The threads above argue it has become the most interesting one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow closed out a donation and opened a debate about its own protocol surface. Sutou Kouhei, the month's most active poster with 16 messages, announced the &lt;a href="https://lists.apache.org/thread/875r822dqjxlrc90q7lgczmv3jbk9btd" rel="noopener noreferrer"&gt;result of the vote to donate Apache Arrow Erlang&lt;/a&gt;. The vote carried with four binding +1s from Sutou Kouhei, Curt Hagenlocher, Matt Topol, and David Li, with no zeros and no vetoes. The Erlang implementation is built on bindings to the Rust implementation, and the next step is formal IP clearance through a vote on the incubator general list. Once that completes, Arrow adds another language community to a roster that already spans most of the ecosystem, and the BEAM world gets first-class columnar data. The donation also says something about how Arrow grows now. New language communities arrive by wrapping the Rust implementation rather than reimplementing the columnar format from scratch, which keeps behavior consistent across languages and concentrates performance work in one codebase. Erlang and Elixir shops run some of the most demanding soft-real-time systems in production, and giving that platform zero-copy columnar data opens analytics patterns it has never had natively.&lt;/p&gt;

&lt;p&gt;The most interesting technical cluster of the week was Flight SQL, where three protocol changes moved in parallel. Tornike Gurgenidze, the same contributor driving the Polaris credential vending work, proposed &lt;a href="https://lists.apache.org/thread/31b23z92vmd5vpp9p9z17941v5lg90zd" rel="noopener noreferrer"&gt;adding four dialect-related SqlInfo codes to FlightSql.proto&lt;/a&gt;, including SQL_SUPPORTED_LIMIT_OFFSET at code 577. The motivation is practical. Clients that compile SQL for many different backends need dialect metadata the protocol does not expose today, and four small codes close real gaps. Meanwhile, the &lt;a href="https://lists.apache.org/thread/sg1d3hwt1hlgzgh16wzbkrb0pzgqsf3n" rel="noopener noreferrer"&gt;vote on adding an is_update field to ActionCreatePreparedStatementResult&lt;/a&gt; collected feedback from four participants, with Jean-Baptiste Onofré adding a non-binding +1 and suggesting the vote run one extra week to give more reviewers time. And Richie Black opened a &lt;a href="https://lists.apache.org/thread/7o09mtxs02h79vcbg7gv9fcdkjs5n3z6" rel="noopener noreferrer"&gt;vote on adding column default value support to JDBC connections through Arrow Flight&lt;/a&gt;, implemented in arrow-java PR #1139, which also touches the FlightSql contract. Three concurrent protocol refinements tell one story: Flight SQL is carrying enough production traffic that the gaps between it and traditional database connectivity are getting filled one field at a time.&lt;/p&gt;

&lt;p&gt;For readers newer to this corner of Arrow, SqlInfo is the mechanism a Flight SQL server uses to describe itself to clients, covering everything from supported SQL features to type behavior. A JDBC driver or a query tool reads those codes before it compiles SQL, so every missing code forces client-side guesswork. Dialect metadata is the difference between a tool that generates correct pagination syntax for each backend and one that ships per-database hacks. Small protocol changes like these are unglamorous, and they are exactly what turns a wire protocol into a platform.&lt;/p&gt;

&lt;p&gt;The format side picked up a fresh proposal as well. Florian R. Hölzlwimmer, following a suggestion from Rok Mihevc on GitHub, opened a discussion on &lt;a href="https://lists.apache.org/thread/tcv35l8o8d33n176kb3qv4y45obcgjbn" rel="noopener noreferrer"&gt;adding an arrow.range canonical extension type for bounded ranges&lt;/a&gt;. Arrow has no canonical representation for ranges today, and the thread drew five participants and six replies working through the design space. Canonical extension types are how Arrow grows its type system without touching the core spec, and ranges are a frequent request from the scientific and genomics communities where bounded intervals are everywhere.&lt;/p&gt;

&lt;p&gt;On the release front, Jean-Baptiste Onofré posted a &lt;a href="https://lists.apache.org/thread/0ohkoh2d69c74gsfsm90d9znmhj45lh0" rel="noopener noreferrer"&gt;heads up that Arrow Java 20.0.0 preparation is underway&lt;/a&gt; and is triaging GitHub issues, inviting anyone with release candidates for inclusion to speak up now. The word cloud on the list this month also shows the steady drumbeat of Rust patch releases, with 56.2.1, 57.3.1, and 58.3.0 all in circulation.&lt;/p&gt;

&lt;p&gt;Then there is the thread that best captures where open source is heading. Wes McKinney, the project's co-creator, wrote in about the &lt;a href="https://lists.apache.org/thread/sfld6f1k8n3n3th36j97bvmj61pv6230" rel="noopener noreferrer"&gt;status of Arrow Conbench data and the Conbench OSS project&lt;/a&gt;. He noticed conbench.ursa.dev has been down, needs continuous project benchmarks again, and is interested in doing development on Conbench, with his AI agents doing the development work. Conbench is the continuous benchmarking framework the Arrow community built to catch performance regressions commit by commit, and a hosted instance going dark means the project loses one of its early-warning systems. For a library whose entire value proposition is speed, continuous benchmarks are not a nice-to-have, so reviving the tooling matters beyond nostalgia.&lt;/p&gt;

&lt;p&gt;Read the agent mention twice, though. The person who started Arrow is now describing agent-driven contribution as a casual aside in an infrastructure email. Combined with the auto Copilot review discussion Sutou Kouhei opened in late May, Arrow is becoming the clearest case study of an Apache project absorbing AI into its daily workflow.&lt;/p&gt;

&lt;p&gt;Community logistics filled out the week. Ian Cook announced &lt;a href="https://lists.apache.org/thread/f5ro9howvjyj9qqljf9hdth074zrrjd5" rel="noopener noreferrer"&gt;ADBC Office Hours on June 11&lt;/a&gt;, hosted with Columnar and featuring David Li, Curt Hagenlocher, and Felipe Oliveira Carvalho, and reminded everyone of the &lt;a href="https://lists.apache.org/thread/8vdyq8pchq02lyvp2b4mqnds6kdxh788" rel="noopener noreferrer"&gt;biweekly community meeting on June 3&lt;/a&gt;. Rich Bowen shared &lt;a href="https://lists.apache.org/thread/gvx0pfcvz58bl8xf07n89154byf7sqtl" rel="noopener noreferrer"&gt;next steps for the Community over Code Glasgow 2026 hackathon&lt;/a&gt;, confirming Arrow's participation in the October event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet's week was about governance in the deepest sense, with the community asking how the format itself should version and evolve. Daniel Weeks opened the big one, a discussion on the &lt;a href="https://lists.apache.org/thread/st8l40z5n4cx5c22rcog50ws4pkzdc3s" rel="noopener noreferrer"&gt;future of Parquet versioning&lt;/a&gt; that he promised at a recent community sync. Twelve replies from seven participants worked through how the format signals capability to readers and writers, a question that has circled the project for years and gains urgency every time a new feature like geometry types or variant lands. Versioning sounds dry until you remember that every engine, every language implementation, and every stored file on earth has to agree on what a version number means. The hard part is that Parquet's installed base is effectively permanent. Files written a decade ago still get read every day, and no version scheme can assume writers and readers upgrade together. The discussion has to balance a reader's need to know whether it can safely consume a file against a writer's need to adopt new encodings without waiting years for the ecosystem to catch up. How the community answers will shape how fast recent additions like new types and the footer work reach production deployments.&lt;/p&gt;

&lt;p&gt;The path_in_schema work crossed a threshold this week. Ed Seidl posted an &lt;a href="https://lists.apache.org/thread/xyjxhwldjc3d0k2r39ls650zd8lr572c" rel="noopener noreferrer"&gt;update on making ColumnMetaData.path_in_schema optional&lt;/a&gt;, reporting that a third proof-of-concept implementation now exists in arrow-cpp and that a test file written without the field, created with arrow-rs, has been submitted to parquet-testing. With three implementations proving the change works, he then opened the &lt;a href="https://lists.apache.org/thread/gm7btrgvprdbrh8c5nv061f402txo0vt" rel="noopener noreferrer"&gt;formal vote on GH-563&lt;/a&gt;, drawing seven participants and six replies. The field repeats schema information in every column chunk's metadata, so making it optional trims fat from footers in wide tables, and the careful PoC-first process is a model for how format changes should land. The mechanics explain the payoff. path_in_schema carries the full column path inside each chunk's metadata, information the footer schema already holds once. In a table with thousands of columns across many row groups, that duplication adds real bytes to every footer and real time to every metadata parse. Letting writers drop it, with arrow-cpp now the third implementation to prove the change alongside the earlier proofs of concept and an arrow-rs-written file landing in parquet-testing, is the diligence that makes a format-level change safe.&lt;/p&gt;

&lt;p&gt;Release energy built on two fronts at once. Gang Wu opened a &lt;a href="https://lists.apache.org/thread/htmof8dodkcmsxbsxrhpx4nq6b3yk600" rel="noopener noreferrer"&gt;discussion on releasing parquet-format 2.13.0&lt;/a&gt;, noting that about nine months have passed since 2.12.0 shipped on August 28, 2025, and that meaningful updates have accumulated since. Five participants weighed in across six replies. The same day's energy carried to the Java side, where Fokko Driesprong proposed &lt;a href="https://lists.apache.org/thread/y0vjr64ofs4mftl23gy3b2twngjr9rr6" rel="noopener noreferrer"&gt;Apache Parquet 1.18.0&lt;/a&gt;, pointing out the project is well past its quarterly release rhythm with a lot of accumulated work. And Ismaël Mejía proposed &lt;a href="https://lists.apache.org/thread/jzjx3wcgo800166myz0k1993w8gwvd0b" rel="noopener noreferrer"&gt;bumping the minimum Java version to 17 for Parquet Java&lt;/a&gt;, since Java 17 has been the baseline LTS since September 2021, nearly five years ago. Iceberg made the same move in its 1.11 cycle, so the lakehouse Java stack is converging on 17 as its floor.&lt;/p&gt;

&lt;p&gt;Two long-running technical threads advanced. Rahul Sharma &lt;a href="https://lists.apache.org/thread/zo8r2q2l02rkfyk1k8ytocvp50tbnmrl" rel="noopener noreferrer"&gt;revived the INT96 stats discussion&lt;/a&gt; with a concrete plan to land Option 1 from Micah Kornfield's earlier summary, keeping INT96 ordering undefined in the format while letting readers opt in through an allow-list, and he has an open parquet-java PR to do it. INT96 timestamps are deprecated but far from dead in stored data, so pragmatism beats purity here. And Russell Spitzer nudged the &lt;a href="https://lists.apache.org/thread/c17snxsfpgstxkfm8yd6psss44z6ywpd" rel="noopener noreferrer"&gt;discussion on a new File logical type&lt;/a&gt;, asking whether the proposal mentioned at the last sync exists yet, in the thread Burak Yavuz started in April. A File type gives Parquet a first-class way to reference external content, which matters more as multimodal and AI workloads push files-about-files into analytics tables.&lt;/p&gt;

&lt;p&gt;Community mechanics stayed healthy. Julien Le Dem ran the &lt;a href="https://lists.apache.org/thread/qz9zh1znv1r3p2xzn0c949bv9psr5gyw" rel="noopener noreferrer"&gt;June 3 sync&lt;/a&gt; and then asked for a &lt;a href="https://lists.apache.org/thread/d474q6yq5onl41m9y8rsyojh3dn6o3wc" rel="noopener noreferrer"&gt;volunteer facilitator for the June 17 sync&lt;/a&gt; while he is on vacation, with three replies already sorting out coverage. Andrew Bell asked the evergreen newcomer question, &lt;a href="https://lists.apache.org/thread/fb1rb9twobb1rtkrf8kyzjy9cndnwhso" rel="noopener noreferrer"&gt;where to find test files to validate a reader&lt;/a&gt;, and got pointed at the project's testing resources. Micah Kornfield led the month's activity with 16 messages, with Gang Wu close behind at 14.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache DataFusion
&lt;/h2&gt;

&lt;p&gt;New to this newsletter, Apache DataFusion is the Rust-native query engine that increasingly powers the execution layer of the lakehouse, and its dev list runs at a pace that fits its roughly monthly major release cadence.&lt;/p&gt;

&lt;p&gt;A quick orientation for readers meeting the project here for the first time. DataFusion is an embeddable query engine written in Rust and built on Apache Arrow's columnar memory model. Where Iceberg defines tables, Polaris catalogs them, and Parquet stores them, DataFusion reads, plans, and executes queries over all of it, and a long list of commercial and open source systems embed it as their execution core. The project started life inside Arrow before graduating to its own top-level Apache project, and its subprojects extend the engine in different directions. Comet accelerates Spark by translating Spark physical plans into DataFusion execution. Ballista distributes execution across nodes. A growing set of language bindings carries the engine beyond Rust, and this week showed exactly why that structure earns DataFusion a permanent slot in this newsletter.&lt;/p&gt;

&lt;p&gt;The week proved the point on cadence too. Andrew Lamb cut the release-54 branch on May 21 and, by June 4, had the &lt;a href="https://lists.apache.org/thread/bvv9tmw3tj1x6kwzxvgfv2xjnt0z4w8b" rel="noopener noreferrer"&gt;vote open on DataFusion 54.0.0 RC1&lt;/a&gt;, based on commit 45d943df, with six participants already verifying the candidate. From branch cut to release candidate in two weeks is normal operating speed for this project, and it is worth pausing on how unusual that is for a foundation project with this many downstream consumers.&lt;/p&gt;

&lt;p&gt;The bigger strategic story is the JVM. Andy Grove, who drove the month with 18 messages, announced that the &lt;a href="https://lists.apache.org/thread/fy8boypz07mfxhd2wj98lk1oyf77sfck" rel="noopener noreferrer"&gt;vote on Apache DataFusion Java 0.1.0 RC1 passed&lt;/a&gt; with five +1 votes, three of them binding, making it the first release of the new DataFusion Java subproject. The bindings, which Andy seeded in mid-May as a minimal JNI bridge that registers Parquet tables and executes queries from the JVM, plan and run everything in native Rust and hand results back to Java. He owned a timing mistake in the vote process with characteristic transparency, and the release stands. The significance is hard to overstate for this audience. The data ecosystem's center of mass still runs on the JVM, and DataFusion Java gives every Java shop a path to Rust-speed query execution without leaving their stack. The 0.1.0 scope is deliberately small, enough to register Parquet tables and run SQL and DataFrame queries end to end, which is the right way to start a binding. Ship the thin slice, prove the JNI boundary holds, and grow the API with real users instead of guessing at one in advance. Anyone who watched PyIceberg or the Iceberg Go client grow from similar seeds knows how quickly a minimal binding becomes load-bearing infrastructure.&lt;/p&gt;

&lt;p&gt;Ballista, the distributed execution subproject, had a full week of its own. Marko Milenković announced that the &lt;a href="https://lists.apache.org/thread/7lgrydnyyglosnqj1gcm9nynbn7xz0n0" rel="noopener noreferrer"&gt;Ballista 53.0.0 release vote passed&lt;/a&gt; with five votes, four binding. Andy Grove published a test version of Ballista to test.pypi.org and &lt;a href="https://lists.apache.org/thread/o6d9lng6pmfqvc3fg853ckmmqqvjogoh" rel="noopener noreferrer"&gt;asked for help verifying the first Ballista PyPi release&lt;/a&gt;, which brings distributed DataFusion within pip-install reach. And Martin Grigorov relayed the team's proposal to &lt;a href="https://lists.apache.org/thread/k6f9sxg7f6xywstb4whqpdd5o9k7cn6q" rel="noopener noreferrer"&gt;drop the Windows CI workflows for Ballista&lt;/a&gt;, citing two reasons: the foundation-wide discussion about runner consumption and the lack of demonstrated interest in better Windows support for Ballista.&lt;/p&gt;

&lt;p&gt;That first reason connects to the same letter Iceberg received. Robert Thomson wrote to the DataFusion PMC about the project's &lt;a href="https://lists.apache.org/thread/07rr90dpwdc0mqk2mxd4drkco50d334x" rel="noopener noreferrer"&gt;consumption of ASF shared GitHub-hosted runners&lt;/a&gt;, with the shared pool at or very close to its limit under the foundation's 2024 GitHub Actions policy. Five replies in, DataFusion is already acting, and the Ballista Windows decision shows what the response looks like in practice. Projects are starting to treat CI minutes as a budget line and cut the platforms and matrices that do not earn their cost.&lt;/p&gt;

&lt;p&gt;Design work continued underneath the release activity. Gene Bordegaray proposed &lt;a href="https://lists.apache.org/thread/14d9fthyoyq76xd3yb89swxclvw91jfp" rel="noopener noreferrer"&gt;introducing a Range partitioning variant&lt;/a&gt; to the engine. DataFusion currently models partitioning as Hash, RoundRobinBatch, or UnknownPartitioning, and that vocabulary cannot accurately represent some real partitioning schemes, which limits what the optimizer can prove about data layout. A Range variant lets the engine reason about ordered, range-partitioned data, which is exactly the shape of most lakehouse tables. Andy Grove also surfaced a &lt;a href="https://lists.apache.org/thread/rgzflbw0yq3l6r2216hmfh58fvlorc22" rel="noopener noreferrer"&gt;discussion about adding geospatial support in Comet&lt;/a&gt;, capturing a conversation that started in a now-closed PR so the wider community can weigh in. Comet, the Spark accelerator that translates Spark physical plans into DataFusion execution, is heading toward a 1.0.0 release targeted for July or August, with a proposed versioning policy and a plan to drop Spark 3.4 support under discussion since mid-May. Geospatial functions in Comet promise accelerated spatial analytics inside existing Spark deployments, no migration required.&lt;/p&gt;

&lt;p&gt;For readers new to the project, the week is a fair sample of why DataFusion now belongs in this newsletter. One engine shipped a release candidate, a new language binding cut its first release, a distributed runtime shipped and reached for PyPi, a Spark accelerator marched toward 1.0, and the core team debated partitioning semantics, all in seven days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;The clearest cross-project signal this week came from outside the projects entirely. Robert Thomson delivered the same message to both the Iceberg and DataFusion PMCs in nearly identical letters: the ASF's shared pool of GitHub-hosted runners has been at or near its limit, and the 2024 GitHub Actions policy is now a constraint these communities have to plan around. The responses are already visible. Ballista is dropping Windows CI, and Iceberg's fourteen-reply thread reads like the start of a real CI budget process. Put this next to the engine-version threads, with Iceberg debating Spark and Flink support windows, Polaris untangling Spark 3 and 4 code organization, Parquet raising its Java floor to 17, and Comet dropping Spark 3.4, and a single picture emerges. The lakehouse stack's support matrix has grown faster than the infrastructure that tests it, and 2026 is the year the bill arrived. Expect narrower version windows and leaner CI matrices across all five projects by year end.&lt;/p&gt;

&lt;p&gt;The second theme is the release train. Counting this week alone, the five projects had a release in active vote (DataFusion 54.0.0), a first-ever release completed (DataFusion Java 0.1.0), a release passed (Ballista 53.0.0), a release scheduled (Polaris 1.6.0 for June 26), two release discussions opened (parquet-format 2.13.0 and Parquet Java 1.18.0), a release in preparation (Arrow Java 20.0.0), a release plan forming (Iceberg C++ 0.3.0), and a release tracking issue opened (iceberg-rust 0.10.0). The post-1.11 lull some expected never happened. The ecosystem ships continuously now, and the projects that built lightweight release machinery, DataFusion above all, set the pace the others are converging toward.&lt;/p&gt;

&lt;p&gt;The third theme is quieter but more consequential. AI is moving inside the projects' own workflows. Wes McKinney mentioned, almost in passing, that his agents will do the development work on Conbench. Jean-Baptiste Onofré is restructuring how Polaris writes proposals so AI tools can participate in authoring and review, and the Directories proposal is the first test. Arrow spent late May debating automated Copilot review of pull requests. None of these communities is debating whether to use AI anymore. They are debating where it fits in governance, and that is a different and more mature conversation. The interesting question for the rest of 2026 is whether the foundation develops shared norms here, the way it did for release votes and IP clearance, or whether each project keeps writing its own rules. The volume of AI-assisted contributions is only going up, and the projects that decide their policies now, calmly and in public, will handle that volume better than the ones that wait for an incident to force the question.&lt;/p&gt;

&lt;p&gt;A final, smaller observation: watch the people who span projects. Tornike Gurgenidze drove a Flight SQL protocol proposal in Arrow and a credential-vending SPI refactor in Polaris in the same week, and Huaxin Gao moved a REST spec vote in Iceberg while converging the idempotency design in Polaris. The lakehouse is one stack, and its most effective contributors increasingly work it that way.&lt;/p&gt;

&lt;p&gt;For practitioners, the week's takeaway is about timing. The next ninety days bring DataFusion 54, Polaris 1.6.0, likely Parquet releases on both the format and Java sides, Arrow Java 20.0.0, and the first wave of post-1.11 Iceberg subproject releases. Teams planning platform upgrades get a rare window where the whole stack refreshes together, and the version-support threads above are advance notice of which older engine combinations are about to fall off the supported list. Reading the dev lists this week was cheaper than reading the release notes next quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;The DataFusion 54.0.0 vote should resolve within days, and the Parquet path_in_schema vote on GH-563 is the format change to watch. The Arrow is_update vote runs an extra week per JB's suggestion, and the Erlang donation moves to the incubator general list for IP clearance. Polaris has a packed June, with the Console repository decision, the Directories proposal review on PR #4613, and the 1.6.0 release targeted for June 26. The Parquet sync on June 17 still needs a facilitator, and ADBC Office Hours land on June 11. On the Iceberg side, the column update threads are the design work that will define the next spec cycle, and the ASF runner discussions on both the Iceberg and DataFusion lists deserve attention from anyone whose CI depends on foundation infrastructure. Next week also brings the Arrow community's next checkpoints on the Conbench revival and the arrow.range extension type discussion, plus whatever follow-up the Polaris Directories proposal draws once reviewers digest PR #4613. Keep an eye on the Parquet versioning thread above all, because its outcome touches every project in this newsletter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-04&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; and build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-06-04&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; and learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;, the O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt;, the O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/ref=sr_1_5?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-5" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/ref=sr_1_7?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-7" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/ref=sr_1_9?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-9" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/ref=sr_1_16?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-16" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI Weekly: New PC Chips, Credit Pricing, Stateless MCP</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:31:15 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/ai-weekly-new-pc-chips-credit-pricing-stateless-mcp-1eb9</link>
      <guid>https://dev.to/alexmercedcoder/ai-weekly-new-pc-chips-credit-pricing-stateless-mcp-1eb9</guid>
      <description>&lt;p&gt;&lt;em&gt;Week of May 28 to June 4, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This week the AI stack moved on three fronts at once. Coding tools reset their pricing and shipped new models, NVIDIA pushed its silicon into the Windows PC, and the Model Context Protocol started its run toward a stateless core. Here is what changed and why it matters for the people who build with these tools every day.&lt;/p&gt;

&lt;p&gt;The common thread is maturity. None of this week's news was a flashy demo. It was the work of making agents cheap to run, fast to serve, and safe to scale. Pricing models, silicon, and protocols are the plumbing of the agent era, and the plumbing got most of the attention this week. That is a sign the technology is moving from the lab into the budget line. The teams that win the next year will not be the ones with the flashiest model. They will be the ones who set sane cost caps, pick durable standards, and keep a human in the loop while everyone else chases the leaderboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: A Pricing and Model Reset
&lt;/h2&gt;

&lt;p&gt;The coding tool market spent the week sorting out two questions at the same time. Who has the best model, and who has the pricing developers will accept. Both answers shifted.&lt;/p&gt;

&lt;p&gt;Anthropic released Claude Opus 4.8 on May 28, 2026, and paired it with a feature called Dynamic Workflows for Claude Code, &lt;a href="https://lushbinary.com/blog/ai-coding-agents-comparison-cursor-windsurf-claude-copilot-kiro-2026/" rel="noopener noreferrer"&gt;according to a June 3 roundup of coding agents&lt;/a&gt;. Opus 4.8 is the newest model in the Claude line. Dynamic Workflows lets Claude Code plan and run multi-step tasks with less hand-holding from the developer. The terminal agent reads the repo, drafts a plan, and works through it step by step. That design suits long tasks like a refactor across many files or a migration that touches a dozen modules.&lt;/p&gt;

&lt;p&gt;The model release matters because of where Claude Code now sits in developer preference. A JetBrains survey of engineers with more than ten years of experience found that 46% picked Claude Code as their daily tool, against 9% for GitHub Copilot, &lt;a href="https://pasqualepillitteri.it/en/news/3392/github-copilot-cursor-claude-code-ai-coding-showdown-2026" rel="noopener noreferrer"&gt;as reported in a June 2 market analysis&lt;/a&gt;. That same analysis traced Cursor's revenue from 100 million dollars in annual recurring revenue in January 2025 to 1 billion by mid-2025. The tools that started agentic, rather than bolting an agent onto an editor, are growing fastest.&lt;/p&gt;

&lt;p&gt;The pricing story was louder. GitHub Copilot moved to usage-based billing on June 1, 2026, &lt;a href="https://scrimba.com/articles/best-ai-coding-assistants-2026/" rel="noopener noreferrer"&gt;a change tracked across several 2026 buyer guides&lt;/a&gt;. The old flat plans gave way to AI Credits. Copilot Pro at 10 dollars a month now buys a 1,500-credit monthly allowance instead of unlimited agent use, and heavy agent sessions burn through that pool fast. GitHub also added a 100 dollar Max plan for the developers who run agents all day. The shift drew a public backlash from users who liked the old flat rate.&lt;/p&gt;

&lt;p&gt;The backlash points at a real problem with agentic coding. Cost is hard to predict. A single long agent run can spend a large share of a monthly budget in one sitting. A Q1 2026 survey cited in one buyer guide found that 42% of developers ranked cost volatility as their top pain point, ahead of model reliability. Teams that adopt agents without a usage cap can get a surprise bill. The fix is simple to state and easy to skip. Set a team-level cap before you turn agents loose.&lt;/p&gt;

&lt;p&gt;Cursor answered the model question with its own engine. The company shipped Composer 2.5 in May 2026, an in-house long-horizon model that the lushbinary roundup says matches Opus 4.7 and GPT-5.5 on benchmarks at 0.50 dollars per million input tokens and 2.50 dollars per million output tokens. Cursor also added Build in Parallel, which runs several agent tasks at once, and a built-in pull request review step. Owning the model gives Cursor control over both price and latency, and it lets the company tune the model for the editing tasks its users run most.&lt;/p&gt;

&lt;p&gt;The rest of the field kept moving too. Windsurf rebranded as Devin Desktop, folding the editor into Cognition's agent brand. Google shipped Antigravity 2.0 with Gemini 3.5 Flash as a fast default model. Kiro switched to a credit model, with Kiro Pro at 20 dollars a month for 1,000 credits, which lines up with Cursor Pro on price. The pattern across all of them is the same. Flat unlimited pricing is fading, and credits or usage tiers are taking its place.&lt;/p&gt;

&lt;p&gt;Security tooling caught up to this shift. Salt Security launched Salt Code on June 2, 2026, &lt;a href="https://www.pr.com/press-release/970002" rel="noopener noreferrer"&gt;per the company's announcement&lt;/a&gt;. Salt Code enforces security policy inside the coding assistant itself. It works across Claude, Cursor, GitHub Copilot, Windsurf, Codex, and Gemini CLI, and it aims to make those tools generate policy-compliant code by default, from the first prompt to production. The pitch lands because of a number that keeps showing up in 2026 surveys. Roughly 48% of AI-generated code carries a security flaw, and about 75% of senior developers still review every snippet before merging. Agents shift where engineers spend time. They do not remove the need to spend it.&lt;/p&gt;

&lt;p&gt;There is a practical lesson in this week's coding news. Pick the tool that fits the work, not the leaderboard. A terminal agent like Claude Code suits a senior engineer running a big refactor. An inline tool like Copilot suits a junior developer who wants visual diffs and quick completions. Match the format to the task, set a cost cap, and keep a human in the review loop. Those three habits hold up no matter which logo wins the quarter.&lt;/p&gt;

&lt;p&gt;The conference calendar reinforced the direction. Anthropic ran its Code with Claude event in San Francisco on May 6, 2026, with stops in London on May 19 and Tokyo on June 10, &lt;a href="https://pasqualepillitteri.it/en/news/3392/github-copilot-cursor-claude-code-ai-coding-showdown-2026" rel="noopener noreferrer"&gt;as noted in the June 2 market analysis&lt;/a&gt;. The company also pushed Cowork, described as Claude Code for general computing. Cowork extends the agent model past code into spreadsheets, file work, and report drafting for people who do not write software. The business is shifting from a license for an assistant to the sale of work that gets done.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Competitive Picture
&lt;/h3&gt;

&lt;p&gt;The model race and the pricing race are reshaping who leads. Cursor's revenue tells the clearest story. The company reached 100 million dollars in annual recurring revenue in January 2025, 500 million by mid-year, and 1 billion by the second half of 2025, with revenue doubling about every two months through early 2026. That growth rate is rare even for fast business software, and it came from a tool that started agentic rather than adding an agent later.&lt;/p&gt;

&lt;p&gt;Claude Code grew on a different axis. The JetBrains survey that put Claude Code at 46% among senior engineers, against 9% for Copilot, reflects a network effect more than a marketing push. Anthropic's enterprise guide for scaling Claude Code became one of the most-read documents in the field, and the skills ecosystem around the tool gave teams a shared library of workflows. Each new skill makes the tool more useful, which draws more users, which produces more skills.&lt;/p&gt;

&lt;p&gt;Microsoft has not stood still. Through 2025 it shipped Copilot's agent mode, added Bring Your Own Key to plug in third-party models, and opened VS Code Insiders to Anthropic's protocols. The work reads as extension rather than redesign. Copilot stays the safe enterprise choice, with deep VS Code integration and reliable suggestions, but it rarely surprises a senior engineer with an idea they had not considered. That reliability suits large teams that value predictability over reach.&lt;/p&gt;

&lt;p&gt;The June 1 billing change put that position under more pressure. Moving longtime Pro users to credits, then adding a 100 dollar Max plan, asked the most loyal users to pay more or ration their agent runs. Some accepted it. Others looked at Cursor and Claude Code, where the agent model came first. Microsoft Build 2026 in San Francisco arrives within weeks and is expected to reset the board again, so this snapshot will age fast.&lt;/p&gt;

&lt;p&gt;The lesson for buyers is not to chase the leader. The leaderboard turns over every quarter. The durable move is to build on the standards all these tools share, MCP and A2A, so a switch between assistants costs a configuration change rather than a rewrite. Tools come and go. The integration layer is what you keep.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the Coding Tools Now Differ
&lt;/h3&gt;

&lt;p&gt;The category split into four shapes in 2026, and the shape matters more than the brand. The first shape is the IDE plugin. GitHub Copilot lives inside VS Code and JetBrains as an extension, offers inline completions and an agent mode, and fits teams that want AI without changing their editor. The second shape is the VS Code fork. Cursor replaces the editor and makes AI a first-class part of the layout, and its Composer feature proposes edits across many files in one pass. The third shape is the terminal agent. Claude Code runs in the shell and treats the codebase as a conversation, not a stream of completions. The fourth shape is the bring-your-own-key open tool, like Cline and Cody, which lets a team plug in its own model and keys.&lt;/p&gt;

&lt;p&gt;These shapes handle different work. Inline completion saves keystrokes on small, local edits. A multi-file composer handles a feature that touches several files at once. A terminal agent handles a long task with many steps, like a framework upgrade or a test-suite rewrite. A team that standardizes on one shape for every task pays for it in mismatched fit.&lt;/p&gt;

&lt;p&gt;The credit math is the part teams keep getting wrong. Copilot Pro at 10 dollars now gives 1,500 credits a month. Inline completions stay light on credits. Agent runs do not. A single long agent session that reads a large repo, drafts a plan, and edits a dozen files can spend hundreds of credits in one sitting. Run three of those in a day, and the monthly pool runs dry by the second week. The 100 dollar Max plan exists for exactly this reason, and so does the new pressure to cap usage per developer and per team.&lt;/p&gt;

&lt;p&gt;Cursor's move to its own Composer 2.5 model is a hedge against that volatility. When a tool rents its model from a frontier lab, its margin and its latency follow that lab's pricing. When it owns the model, it sets both. Composer 2.5 runs at 0.50 dollars per million input tokens and 2.50 dollars per million output tokens, which undercuts frontier pricing for the editing tasks Cursor users run most. Build in Parallel then spends that cheaper inference on several agent tasks at once, which turns a price advantage into a speed advantage.&lt;/p&gt;

&lt;p&gt;The review burden is the quiet cost in all of this. One 2026 measurement found senior engineers now spend about 11.4 hours a week reviewing code against 9.8 hours writing it. AI shifted the bottleneck from writing to reading. That is why the security layer grew this week. Salt Code sits in the assistant and blocks policy violations before they reach a pull request, which moves some of the review load earlier, where it is cheaper to fix.&lt;/p&gt;

&lt;p&gt;Cowork points at where this goes next. Anthropic built it as Claude Code for general computing, aimed at people who do not write software, and it runs agent work across spreadsheets, files, and reports. The same agent pattern that refactors code can reconcile a budget or draft a status report. For data teams, that matters because the analyst down the hall now has an agent that can touch the same files and systems the engineers use. Governance has to cover both.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: NVIDIA Pushes Into the PC
&lt;/h2&gt;

&lt;p&gt;The biggest hardware story of the week did not come from a data center. It came from a laptop chip. NVIDIA CEO Jensen Huang presented the RTX Spark Superchip at Computex 2026 in Taipei on June 1, &lt;a href="https://www.cnbc.com/2026/06/02/nvidias-new-pc-chips-are-ceos-bid-to-own-every-part-of-ai-stack.html" rel="noopener noreferrer"&gt;as covered by CNBC&lt;/a&gt;. The chip is a system-on-chip for Windows PCs, and Huang said NVIDIA and Microsoft plan to reinvent the PC.&lt;/p&gt;

&lt;p&gt;The market read the move as a threat. Shares of AMD, Intel, and Qualcomm fell after the announcement. NVIDIA has owned the data center for years. Now it is going after the PC, and Wall Street took notice. The RTX Spark targets the edge, where phones and laptops run AI models on the device instead of calling the cloud for every request.&lt;/p&gt;

&lt;p&gt;The chip also marks a bet on Arm. CPUs have run on the x86 instruction set that Intel pioneered in the 1970s and AMD extended later. Arm's lower-power design went mainstream when Apple put it in the first iPhone in 2007, and Amazon brought it to the data center with Graviton in 2018. NVIDIA building an Arm-based SoC for Windows machines extends that arc into the consumer PC. Low power draw, not raw clock speed, is the selling point for on-device AI.&lt;/p&gt;

&lt;p&gt;On-device inference solves real problems. It cuts latency because the model runs next to the user. It improves privacy because data does not leave the device. It lowers cost because each query does not hit a cloud GPU. For developers, an Arm laptop with a strong NPU changes what a local model can do. A coding agent that runs partly on the device responds faster and keeps source code off third-party servers.&lt;/p&gt;

&lt;p&gt;The PC push sits inside a larger contest over inference. AI moved past the training phase, where NVIDIA has dominated, into deployment, where the field is wider. AMD submitted its MI355X system to the MLPerf Inference v6.0 benchmark on April 1, 2026, showing the reach of its ROCm software stack, &lt;a href="https://rocm.blogs.amd.com/blog/author/wei-ting-liao.html" rel="noopener noreferrer"&gt;per AMD's ROCm blog&lt;/a&gt;. Google's TPUs, AWS Inferentia, and Intel Gaudi all compete for inference workloads. Each wins on a different axis, whether that is latency, throughput, memory, or cost.&lt;/p&gt;

&lt;p&gt;NVIDIA's roadmap stretches past the PC. At its GTC conference earlier in 2026, the company set plans to ship its next-generation Vera Rubin system in the second half of the year, with a successor named Feynman after the physicist. NVIDIA also bought assets from inference startup Groq in a 17 billion dollar deal and signaled an inference-focused chip strategy. The RTX Spark is the consumer face of a plan that spans the whole stack, from the phone to the rack.&lt;/p&gt;

&lt;p&gt;For teams choosing inference hardware, the guidance has not changed. Define what speed means for your workload first. Time-to-first-token matters for chat. Tokens per second matters for batch serving. A chip that wins on throughput can still feel slow if its tail latency spikes under load. Benchmark two candidates with your real prompt lengths and output sizes before you commit. Hardware comparisons that ignore prompt length and tail latency mislead more than they help.&lt;/p&gt;

&lt;p&gt;The edge trend has a second-order effect worth watching. As more inference moves to the device, the cloud bill shifts from per-query GPU time toward model distribution and sync. Smaller models that fit on a laptop NPU get more attention. Distillation, quantization, and mixture-of-experts routing all help a model run on less silicon. The RTX Spark gives those techniques a large new install base to target.&lt;/p&gt;

&lt;h3&gt;
  
  
  Picking Inference Hardware in 2026
&lt;/h3&gt;

&lt;p&gt;The RTX Spark sharpened a choice that data and ML teams face all year. Which inference hardware fits the workload. The field now has five families in regular production use. NVIDIA GPUs like the H200 and the Blackwell line lead on raw performance and on software maturity through CUDA. AMD's Instinct line, including the MI300X and the newer MI355X, wins on memory capacity, which cuts the need to shard a large model across many chips. Google Cloud TPUs and the Trillium accelerators fit teams already inside Google Cloud. AWS Inferentia fits teams on AWS. Intel Gaudi 3 and the Xeon 6 line make a case for CPU-plus-accelerator inference at the edge.&lt;/p&gt;

&lt;p&gt;Speed has more than one meaning, and that trips up buyers. Time-to-first-token measures how fast a chat reply starts. Tokens per second measures how much a system serves at scale. Tail latency measures the worst-case response under load. A chip that posts a great tokens-per-second number can still feel slow in a product if its first token lags or its tail latency spikes when traffic climbs. The MLPerf Inference suite tracks these across vendors, and the v6.0 round on April 1, 2026, drew submissions from NVIDIA, AMD, Google, and several startups.&lt;/p&gt;

&lt;p&gt;The memory angle decides many real cases. A model that fits in one chip's memory runs faster than the same model split across four, because sharding adds communication overhead. AMD's large-memory parts win here. A 70-billion-parameter model that needs sharding on a smaller GPU can run on a single large-memory accelerator, which removes the cross-chip traffic. For teams that serve big models, memory capacity often beats peak throughput on the spec sheet.&lt;/p&gt;

&lt;p&gt;The RTX Spark changes the math at the small end. On-device inference moves the cheapest, most private workloads off the cloud entirely. A laptop with a strong NPU runs a small model for code completion, document search, or local chat without a network round trip. That pushes interest toward models built to run small. Quantization shrinks a model's weights to fewer bits. Distillation trains a small model to mimic a large one. Mixture-of-experts routing turns on only part of a model per query. All three let a capable model fit on consumer silicon, and the RTX Spark gives those techniques a large install base.&lt;/p&gt;

&lt;p&gt;The market reaction shows the stakes. NVIDIA moving into the PC sent AMD, Intel, and Qualcomm shares down on June 1, because each of those companies counts on the PC chip market that NVIDIA just entered. The Arm angle adds to the pressure. NVIDIA, Apple, Amazon, and Qualcomm all build on Arm's power-saving design, while Intel and AMD built their businesses on x86. A strong Arm PC chip from NVIDIA pulls more of the market toward Arm, which is a slow but real shift in who sets the platform.&lt;/p&gt;

&lt;p&gt;The buying advice stays steady through all of it. Estimate the memory your model needs first. Benchmark two candidate platforms with your real prompt and output lengths, not a vendor's demo numbers. Watch tail latency under the load you expect, not the average. Then pick the platform you can scale and staff, because a chip you cannot get capacity for is not fast at any spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Standards and Protocols: MCP Goes Stateless
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol reached its biggest turning point since launch. The MCP maintainers published the 2026-07-28 specification release candidate on May 21, 2026, and the ten-week validation window for that candidate ran through this week, &lt;a href="https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/" rel="noopener noreferrer"&gt;per the Model Context Protocol blog&lt;/a&gt;. The final specification ships on July 28, 2026. This is the largest revision of the protocol since it appeared in late 2024.&lt;/p&gt;

&lt;p&gt;The headline change is that MCP is now stateless at the protocol layer. Six Specification Enhancement Proposals work together to get there. The practical effect is large. A remote MCP server that used to need sticky sessions, a shared session store, and deep packet inspection at the gateway can now run behind a plain round-robin load balancer. Clients route traffic on an Mcp-Method header and cache the list of tools for as long as the server allows. That makes MCP servers far easier to scale on ordinary HTTP infrastructure.&lt;/p&gt;

&lt;p&gt;The release candidate carries more than the stateless core. It adds an Extensions framework, a Tasks extension for long-running work, and MCP Apps for server-rendered user interfaces. It hardens authorization to line up with OAuth and OpenID Connect deployments. It also adds a formal deprecation policy so the protocol can change without breaking the integrations teams already shipped. The ten-week window gives SDK maintainers and client builders time to test the changes against real systems before the spec locks.&lt;/p&gt;

&lt;p&gt;A short bit of history explains why stateless matters. The June 2025 MCP update classified MCP servers as OAuth Resource Servers and required clients to implement RFC 8707 Resource Indicators to block token misuse. Those security gains came with a cost. Stateful sessions tied each client to a specific server instance, which made horizontal scaling hard. The new spec standardizes session creation, resumption, and migration, so a server restart or a scale-out event stays invisible to connected clients. That is the missing piece for running MCP at enterprise scale.&lt;/p&gt;

&lt;p&gt;MCP's reach is wide enough to make this revision a big deal. As of March 2026, MCP passed 97 million monthly SDK downloads and 81,000 GitHub stars, and every major AI vendor supports it, including Anthropic, OpenAI, Google, Microsoft, and AWS. Official SDKs exist for TypeScript, Python, C#, Java, and Swift. The community has published hundreds of public MCP servers. People call it the USB-C of AI for a reason. It gives any model one way to talk to any tool.&lt;/p&gt;

&lt;p&gt;Governance moved in step with the spec. Anthropic donated MCP to the Agentic AI Foundation in December 2025, a directed fund inside the Linux Foundation co-founded by Anthropic, Block, and OpenAI. The maintainer team also grew. Clare Liguori joined the core maintainer group, and Den Delimarsky joined as a lead maintainer. Vendor-neutral governance gives enterprise buyers the confidence to build on a standard that no single company controls.&lt;/p&gt;

&lt;p&gt;MCP does not stand alone. The Agent2Agent protocol covers the other half of the agent stack. A2A marked its one-year milestone with more than 150 supporting organizations and deep integration across Google, Microsoft, and AWS platforms, &lt;a href="https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year" rel="noopener noreferrer"&gt;per the Linux Foundation&lt;/a&gt;. The two standards split the work cleanly. MCP defines how an agent connects to tools and data. A2A defines how agents talk to each other across vendor and org boundaries. Both live under the Linux Foundation, which keeps them complementary rather than rival.&lt;/p&gt;

&lt;p&gt;For builders, the stateless shift changes deployment math today. A team that wanted to run an MCP server for its data platform used to plan around session affinity. Now it plans around plain HTTP scaling. That lowers the operational cost of exposing a tool to agents. Expect more SaaS platforms and data systems to ship MCP servers once the final spec lands, because the cost of running one just dropped.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Stateless MCP Changes at the Wire
&lt;/h3&gt;

&lt;p&gt;The stateless core sounds abstract until you run a server. Before, an MCP client opened a session and stayed bound to one server instance for the life of that session. The server held state in memory, so a load balancer had to send every request from that client back to the same box. That meant sticky sessions, a shared session store for failover, and a gateway smart enough to inspect traffic. All of that adds cost and risk.&lt;/p&gt;

&lt;p&gt;The new spec removes the binding. A client routes requests on an Mcp-Method header, so a plain round-robin load balancer can send any request to any server instance. The server tells the client how long it can cache the list of tools through a time-to-live value, so the client does not re-fetch the tool list on every call. A server restart no longer drops the client, because there is no session to lose. This is the change that lets a company run an MCP server the same way it runs any other stateless web service.&lt;/p&gt;

&lt;p&gt;The Extensions framework is the second big piece. It lets the protocol grow without bloating the core. The Tasks extension handles long-running work, so an agent starts a job, walks away, and checks back, instead of holding a connection open for minutes. MCP Apps lets a server render a user interface, so a tool shows a form or a chart inside the client rather than returning raw text. These extensions ship as optional add-ons, which keeps the base protocol small while giving advanced servers room to do more.&lt;/p&gt;

&lt;p&gt;The deprecation policy is the piece enterprises asked for. A formal policy means the protocol retires an old behavior on a known schedule, with warning, instead of breaking integrations without notice. That predictability is what lets a large company commit to building on MCP. It turns the protocol from a fast-moving open source project into something a platform team plans around for years.&lt;/p&gt;

&lt;p&gt;The split between MCP and A2A is worth keeping straight. MCP connects one agent to tools and data. A2A connects agents to each other across vendors and orgs. A concrete workflow shows the split. A planning agent uses A2A to hand a subtask to a specialist agent at another company. That specialist agent uses MCP to query a database and run a tool. The first protocol carries the handoff. The second carries the data access. Both live under the Linux Foundation, which keeps them aligned rather than competing.&lt;/p&gt;

&lt;p&gt;For data platforms, the timing is good. A team that ships an MCP server for its catalog or query engine after July 28 gets the stateless core from day one. That server scales on ordinary infrastructure, vends governed access, and shows up to every MCP-aware agent without custom glue. The cost of making a data system agent-ready dropped this week, and the standard to target is now clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic AI Moves From Demo to Default
&lt;/h2&gt;

&lt;p&gt;A theme ran under all three sections this week. Agents are no longer a demo. They are the default unit of work. The pricing changes, the on-device chip, and the stateless protocol all serve the same goal, which is running more agent work, more reliably, at lower cost.&lt;/p&gt;

&lt;p&gt;The numbers back the shift. The AI coding tools market sits near 12.8 billion dollars in 2026, up from 5.1 billion in 2024, and about 90% of professional developers use a coding tool daily, &lt;a href="https://www.nipralo.com/blogs/best-ai-coding-tools-2026" rel="noopener noreferrer"&gt;per a 2026 market review&lt;/a&gt;. The 2025 Stack Overflow Developer Survey found that 84% of developers use or plan to use AI tools, and 51% of professionals use them daily. Adoption is not the question anymore. Control is.&lt;/p&gt;

&lt;p&gt;Open source maintainers feel the change in a new way. A DataFusion blog post on May 28, 2026, by Tim Saucer made the point that a growing share of a library's users are not typing code at all, &lt;a href="https://datafusion.apache.org/blog/" rel="noopener noreferrer"&gt;per the DataFusion blog&lt;/a&gt;. They ask an agent to write it. The agent leans on whatever style it picked up during training, which rarely matches what a project wants. That pushes maintainers to publish machine-readable guidance, like skills files and idiomatic examples, so agents generate code the project will accept. The audience for documentation now includes the model, not just the human.&lt;/p&gt;

&lt;p&gt;Security stays the gap between demo and default. The 48% flaw rate in AI-generated code is the reason tools like Salt Code exist. The right pattern pairs an agent with a policy gate and a human reviewer. The agent drafts fast. The gate blocks the obvious mistakes. The human signs off on the merge. Teams that skip the middle two steps trade speed today for incidents later.&lt;/p&gt;

&lt;p&gt;Enterprise buyers should treat 2026 as the year to set standards, not just pick tools. Decide which protocols you will support, which is now mostly MCP and A2A. Decide how you will cap agent cost. Decide who reviews agent output before it ships. Those three decisions matter more than the brand of any single assistant, because the assistants will keep changing every quarter while the standards settle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Documentation Now Has Two Audiences
&lt;/h3&gt;

&lt;p&gt;The open source world is adjusting to a reader it did not design for. When a developer asks an agent to use a library, the agent writes the code, not the human. The agent draws on patterns it learned during training, and those patterns lag behind a project's current style. A library that shipped a cleaner API last quarter still gets old-style code from agents that learned the old way.&lt;/p&gt;

&lt;p&gt;Maintainers are responding with machine-readable guidance. Skills files, idiomatic examples, and clear API docs now serve the model as much as the person. A project that publishes a good skill teaches every agent that reads it how to write code the project will accept. That lowers the review burden on maintainers and cuts the number of pull requests that miss the house style.&lt;/p&gt;

&lt;p&gt;This changes how teams should write internal docs too. An engineering org that wants its agents to follow internal conventions has to write those conventions down in a form an agent reads. A style guide that lives in someone's head does not reach the agent. A style guide checked into the repo, with examples, does. The cost of undocumented conventions just went up, because now both the new hire and the agent pay it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Data Teams
&lt;/h2&gt;

&lt;p&gt;Data teams sit at the center of the agent shift, because agents are only as good as the data they can reach. An agent that writes SQL is useful. An agent that queries live, governed data and acts on the result is far more useful. That is the work that turns an agent from a chat box into a system of record.&lt;/p&gt;

&lt;p&gt;Three of this week's threads land directly on data work. MCP going stateless lowers the cost of exposing a data platform to agents through a standard interface. The on-device chip trend means analysts will run small models next to their notebooks for fast, private exploration. The pricing reset means data teams need a cost model for agent queries the same way they have one for warehouse compute.&lt;/p&gt;

&lt;p&gt;The hard part is governance. An agent that can read and act on data needs the same controls a human user gets. That means catalog-level permissions, audit logs, and a clear record of which agent ran which query. The lakehouse pattern fits this well, because an open catalog can vend credentials and enforce access in one place. As agents become the default query writer, the catalog becomes the control point for the whole system.&lt;/p&gt;

&lt;p&gt;A concrete example shows the pattern. Suppose an analyst asks an agent to find which regions missed their sales target last quarter and draft a summary. The agent connects to the data platform through an MCP server. The catalog checks the agent's permissions and vends a short-lived credential scoped to the tables the analyst can see. The agent runs the query, reads the result, and writes the summary. Every step lands in an audit log, tagged with the agent's identity and the analyst who triggered it. If the analyst lacks access to a table, the agent gets the same denial a person gets.&lt;/p&gt;

&lt;p&gt;That flow turns an open question into a governed action. The catalog is the control point, because it sits between the agent and the data and enforces access in one place. An open catalog that vends credentials and logs every request gives the security team a single seam to watch. As agents become the default query writer, that seam carries more traffic, which makes the catalog one of the busiest and most important parts of the stack.&lt;/p&gt;

&lt;p&gt;The takeaway for data engineers is direct. Treat agents as a new class of user. Give them a standard way in through MCP. Give them governed access through a catalog. Give them a cost budget and an audit trail. Do that, and the agent shift turns into a productivity gain instead of a governance headache.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Playbook for Adopting Agents This Quarter
&lt;/h2&gt;

&lt;p&gt;The week's news points at a short list of decisions every team should make now. None of them depend on which assistant wins. They hold up across tools, because they govern how you run agents, not which agent you run.&lt;/p&gt;

&lt;p&gt;Start with cost. Pick a per-developer credit cap and a per-team cap before you roll out an agent. Usage-based pricing means one long session can spend a big share of a budget in an afternoon. A cap turns a surprise bill into a known ceiling. Review the caps monthly, because the work shifts and the right number shifts with it.&lt;/p&gt;

&lt;p&gt;Set a protocol standard. For tool and data access, that means MCP. For agent-to-agent work, that means A2A. Both sit under the Linux Foundation, both have wide vendor support, and both will keep their shape as the spec settles. Building on these two now avoids the custom glue that becomes a maintenance burden later.&lt;/p&gt;

&lt;p&gt;Define the review path. Decide who signs off on agent output before it ships, and where the policy gate sits. The 48% flaw rate in AI-generated code is the reason this step is not optional. A policy gate inside the assistant catches the obvious problems early. A human reviewer catches the rest. Both stay in the loop until the data says otherwise.&lt;/p&gt;

&lt;p&gt;Treat agents as a class of user. An agent that reads and writes data needs the same controls a person gets. That means scoped permissions, an audit log, and a record of which agent ran which action. Give agents a standard way in, give them governed access, and give them a budget. Skip any of the three, and the productivity gain turns into a cleanup project.&lt;/p&gt;

&lt;p&gt;Plan for the edge. The RTX Spark and the broader on-device trend mean some inference moves to the laptop this year. Decide which workloads belong on the device, where latency and privacy matter most, and which belong in the cloud, where scale and large models matter most. A small local model for code completion and a large cloud model for hard reasoning is a sensible default split.&lt;/p&gt;

&lt;p&gt;Write your conventions down. Agents follow the guidance you publish, not the rules in your head. A style guide and example set checked into the repo reaches both your new hires and your agents. This is the cheapest item on the list and the one teams skip most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Week in Brief
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.8 shipped on May 28 with Dynamic Workflows for Claude Code. GitHub Copilot moved to usage-based AI Credits on June 1 and added a 100 dollar Max plan, which drew a backlash over cost predictability. Cursor shipped its in-house Composer 2.5 model with parallel builds and pull request review. Windsurf became Devin Desktop, Google shipped Antigravity 2.0 with Gemini 3.5 Flash, and Kiro switched to credits. Salt Security launched Salt Code on June 2 to enforce policy inside coding assistants.&lt;/p&gt;

&lt;p&gt;NVIDIA presented the RTX Spark Superchip at Computex on June 1, an Arm-based system-on-chip for Windows PCs, and AMD, Intel, and Qualcomm shares fell on the news. The Model Context Protocol published its 2026-07-28 release candidate, moving the protocol to a stateless core with Tasks, MCP Apps, and an Extensions framework, with the final spec set for July 28. The A2A protocol passed 150 supporting organizations at its one-year mark.&lt;/p&gt;

&lt;p&gt;The through-line is control. The models keep getting better and cheaper to run. The open question for every team is how to govern the agents those models power, how to cap their cost, and how to keep a human in the loop. Those are the decisions that separate teams that ship from teams that clean up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free.&lt;/strong&gt; Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-04-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data.&lt;/strong&gt; Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-04-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community.&lt;/strong&gt; Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=06-04-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development.&lt;/strong&gt; Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis.&lt;/strong&gt; A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>mcp</category>
      <category>news</category>
    </item>
  </channel>
</rss>
