<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Merced</title>
    <description>The latest articles on DEV Community by Alex Merced (@alexmercedcoder).</description>
    <link>https://dev.to/alexmercedcoder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F288069%2Fb20116a9-b178-4ab1-bcb0-8aa28ed732b0.png</url>
      <title>DEV Community: Alex Merced</title>
      <link>https://dev.to/alexmercedcoder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexmercedcoder"/>
    <language>en</language>
    <item>
      <title>Apache Data Lakehouse Weekly: May 21-27, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 27 May 2026 14:50:46 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-may-21-27-2026-k3b</link>
      <guid>https://dev.to/alexmercedcoder/apache-data-lakehouse-weekly-may-21-27-2026-k3b</guid>
      <description>&lt;p&gt;The week after a major release tends to look quiet on a project's dev list. This one did not. With Iceberg 1.11.0 and 1.10.2 both out the door the week before, and Polaris 1.5.0 shipping right at the start of this window, you might have expected the lakehouse projects to take a breath. Instead the conversation shifted from "what are we shipping" to "what do we build on top of what we just shipped," and that turned out to be a busier and more interesting set of threads. Encryption moved from Iceberg core into the catalog layer. The REST spec picked up two new client-facing extensions. Arrow took a donation across the finish line and started arguing about whether a bot should review its pull requests. Parquet finally voted on a statistics change that had been circling the list for years. Taken together, the week was about the connective tissue of the lakehouse: the catalog, the protocol, the client contract, and the unglamorous governance work that keeps four independent projects interoperable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;The single most consequential thread of the Iceberg week was procedural rather than technical. Ryan Blue posted the &lt;a href="https://lists.apache.org/thread/30voqgmho92o0bnv79qcz1rzlkwh5z39" rel="noopener noreferrer"&gt;RESULT for the vote to add an unregister endpoint to the REST spec&lt;/a&gt;, which passed with 16 +1 votes, 9 of them binding, and no dissent. That is a strong mandate for what is a deceptively important capability. Until now the REST catalog has had no standard way to drop a table's registration without also deleting its data and metadata, which matters enormously for the migration and multi-catalog scenarios that Iceberg increasingly has to support. As tables move between catalogs, or as a catalog needs to hand off ownership of a table to another system, unregister is the clean primitive that makes that safe. The breadth of binding support tells you this was not controversial in substance, only in getting the wording precise enough to standardize.&lt;/p&gt;

&lt;p&gt;Around that vote, a cluster of REST spec discussions showed the catalog protocol entering a more mature phase where the arguments are about extensibility and forward compatibility rather than core mechanics. Prashant Singh opened a &lt;a href="https://lists.apache.org/thread/xlqx6k7g625p38bxxy141wt02d00w2h4" rel="noopener noreferrer"&gt;discussion on adding an X-Iceberg-Client-Capabilities header to the REST spec&lt;/a&gt;, which came out of the Read Restrictions community sync on May 12. The idea is to give clients a standard way to advertise what they support so that servers can adapt their responses, which is exactly the kind of negotiation mechanism a protocol needs once it has many independent implementations that ship on different schedules. In a related vein, Alexandre Dutra summarized the outcomes from the catalog sync in a thread on &lt;a href="https://lists.apache.org/thread/wv64wgq9n9ydk0pblwphcjjz528vjx72" rel="noopener noreferrer"&gt;passing arbitrary information to request signers&lt;/a&gt;, refining the language around how clients hand context to the components that sign storage requests. Both threads point at the same underlying reality: the REST catalog is now the integration surface that the whole ecosystem leans on, and the community is carefully building the seams that let it evolve without breaking clients.&lt;/p&gt;

&lt;p&gt;Release work did not stop with 1.11.0. The non-Java implementations are all moving in parallel. Matt Topol opened the &lt;a href="https://lists.apache.org/thread/sgtobp6b3w9b3t4xdzo2xfgrsz960yxv" rel="noopener noreferrer"&gt;vote on Apache Iceberg Go v0.6.0 RC2&lt;/a&gt;, which drew nine participants and active testing, while Junwang Zhao started a &lt;a href="https://lists.apache.org/thread/3vdtx3m4xbcw5htj246yhjt6wrc5rjo8" rel="noopener noreferrer"&gt;discussion on releasing Iceberg C++ 0.3.0&lt;/a&gt; and Alex Stephen kicked off planning for &lt;a href="https://lists.apache.org/thread/81hjtpqgz8512grogr2cdrrwlo74szos" rel="noopener noreferrer"&gt;PyIceberg 0.12.0&lt;/a&gt; by pointing contributors at the milestone to flag anything still missing. Three client libraries in three languages all cycling toward releases in the same week is a good illustration of how the project has decoupled its implementations so they can each move at their own pace rather than waiting on the Java reference.&lt;/p&gt;

&lt;p&gt;The post-1.11.0 design conversation also got going. Steven Wu opened a &lt;a href="https://lists.apache.org/thread/kd183vz2v2y69v4kwbz5wbjfxvx3gf1f" rel="noopener noreferrer"&gt;discussion on Flink version support after Iceberg 1.11.0&lt;/a&gt;, working through how the project should manage the matrix of supported Flink versions going forward, a recurring maintenance question that every engine integration eventually has to confront. Stepan Stepanishchev proposed &lt;a href="https://lists.apache.org/thread/qvx6bj330vqr7t5q1x12tc0jsb2v0c3n" rel="noopener noreferrer"&gt;adding Flink SQL procedure support to Iceberg&lt;/a&gt;, which would let users invoke Iceberg maintenance and management operations through Flink's CALL statement the way they already can in Spark. Noritaka Sekiyama proposed &lt;a href="https://lists.apache.org/thread/vn4gglocg2g40p69mfrrh86qzkn1rr4b" rel="noopener noreferrer"&gt;adding an OpenTelemetry-based MetricsReporter to iceberg-core&lt;/a&gt; that would export ScanReport and CommitReport data to any OTLP-compatible backend, a genuinely useful piece of observability plumbing that drew seven participants and nine replies. Iceberg already ships metrics reporting interfaces, but standardizing on OpenTelemetry would let operators wire table-level telemetry into the same monitoring stack they use for everything else.&lt;/p&gt;

&lt;p&gt;Two threads dealt with the practical grind of running a large open source project. Robert Thomson, writing on behalf of ASF infrastructure, raised &lt;a href="https://lists.apache.org/thread/9gorr3b1c18f8yk2fys16knjmnrbkjff" rel="noopener noreferrer"&gt;Iceberg's consumption of the shared GitHub-hosted Actions runners&lt;/a&gt;, part of a foundation-wide effort to keep CI usage within the shared pool's limits. It drew eight participants quickly, because every active committer feels the pain of CI queueing. Max Konstantinov opened a &lt;a href="https://lists.apache.org/thread/qvsvdn0nsj4wv3ox004h0948xp0c83bk" rel="noopener noreferrer"&gt;discussion on sunsetting MkDocs for the project's versioned documentation&lt;/a&gt;, noting that the MkDocs project itself appears effectively abandoned with no new contributions in roughly 18 months, which makes it a risky foundation for the docs that the whole community depends on. These are not glamorous threads, but they are the kind of maintenance the project has to stay on top of to keep growing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris had the cleanest headline of the week. Jean-Baptiste Onofré &lt;a href="https://lists.apache.org/thread/wcokv8nt5jcktd5203d1t2bx2wmxsrdp" rel="noopener noreferrer"&gt;announced the release of Apache Polaris 1.5.0&lt;/a&gt;, following the &lt;a href="https://lists.apache.org/thread/pwk28kdy65zm6r86t5z7vph5grod0ghd" rel="noopener noreferrer"&gt;vote that passed&lt;/a&gt; with binding +1s from Robert Stupp, François Papon, Dmitri Bourlatchkov, Yong Zheng, and Onofré himself. The release headlines Apache Ranger support as an external authorizer, alongside CLI improvements and Helm chart work. Ranger integration is a meaningful step for enterprise adoption, because it lets organizations that already run Ranger for their broader data platform extend those same authorization policies to their Iceberg catalog rather than maintaining a separate access model. That this is the first release since the project's graduation discussions, and that it shipped on a clean RC0, says good things about where Polaris is in its maturity curve.&lt;/p&gt;

&lt;p&gt;The more forward-looking work was about authorization and delegation, which are clearly the project's center of gravity right now. Yufei Gu posted &lt;a href="https://lists.apache.org/thread/r3qy9sm3nmzrjh12t6hyrl04xcq3hklq" rel="noopener noreferrer"&gt;updates on the Delegation Service design document&lt;/a&gt;, noting that he and Onofré will co-author it and that the pull request reflects the latest direction on the pull-versus-push modes question. Sung Yun followed up on the &lt;a href="https://lists.apache.org/thread/fdl8141k0qsrhzskj4tnyc6jjkjtrtgo" rel="noopener noreferrer"&gt;dedicated sync on Polaris authorization&lt;/a&gt;, proposing to fold earlier authorization discussion into the new authorization SPI work. The throughline is that Polaris is building a pluggable authorization architecture rather than hardcoding a single model, which is the right call for a catalog that has to serve organizations with wildly different security requirements. Ranger landing in 1.5.0 is the first external authorizer; the SPI work is what makes the second and third ones tractable.&lt;/p&gt;

&lt;p&gt;Two threads showed Polaris reaching across project boundaries. Alexandre Dutra opened an &lt;a href="https://lists.apache.org/thread/8fk5yzo73o8dzsoyhhqhzp0mbst6tf4f" rel="noopener noreferrer"&gt;Iceberg 1.11 feature branch retrospective&lt;/a&gt;, evaluating the experience of maintaining a feature/iceberg-1.11 branch to stay ahead of upcoming Iceberg enhancements and deciding what to do with it now that 1.11 has shipped. Adam Szita started a &lt;a href="https://lists.apache.org/thread/fdcwd7bl7fopfxxsk0mx964sbcjwnmhn" rel="noopener noreferrer"&gt;discussion on Iceberg table encryption support in Polaris&lt;/a&gt;, picking up directly from the base encryption implementation that landed in Iceberg 1.11. This is the most important cross-project signal of the week and worth dwelling on. Iceberg shipped KMS-based key wrapping and encrypted data, delete, manifest, and manifest-list files in 1.11, but encryption is only useful end to end if the catalog knows how to manage and hand out keys. The fact that Polaris opened this thread within days of the Iceberg release shows how tightly the catalog and the table format now move together. The encryption story does not work unless both halves cooperate, and the community is treating it that way.&lt;/p&gt;

&lt;p&gt;There was also a healthy run of operational and integration discussion. Adnan Hemani followed up on an &lt;a href="https://lists.apache.org/thread/wonydo5hfpxsoym9m4ws1llz9rlshdtt" rel="noopener noreferrer"&gt;OpenLineage proposal&lt;/a&gt; for lineage tracking, Bill Bejeck floated a &lt;a href="https://lists.apache.org/thread/nk9sf1xbc6ljrclpb09w0gqjd5pm9sjj" rel="noopener noreferrer"&gt;diagnostics shell prototype&lt;/a&gt; to answer simple operational questions like how many tables a bootstrapped Polaris instance is managing, and Dmitri Bourlatchkov pushed on the practical question of how to land &lt;a href="https://lists.apache.org/thread/kzt7rn3h75nqw1mkr632s1g5f59w4vxn" rel="noopener noreferrer"&gt;generic table delegation in the Polaris SparkCatalog&lt;/a&gt;. Alexandre Dutra also proposed &lt;a href="https://lists.apache.org/thread/229mo9o87zyfyl290kfycf0q7kcsk1pb" rel="noopener noreferrer"&gt;forbidding special characters in entity names&lt;/a&gt; that most cloud providers reject or discourage, the kind of guardrail that prevents a class of cryptic failures down the line. And in a thread that captures where the industry's head is at, Dennis Huo proposed an &lt;a href="https://lists.apache.org/thread/518o8q58jnyd70gcok6j5mw9t4nco687" rel="noopener noreferrer"&gt;agentic eval meta-skill for extensibility and maintainability&lt;/a&gt;, exploring how the project should think about agentic development as a first-class tool rather than something contributors do off to the side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow's week was anchored by a donation reaching its conclusion. Sutou Kouhei posted the &lt;a href="https://lists.apache.org/thread/875r822dqjxlrc90q7lgczmv3jbk9btd" rel="noopener noreferrer"&gt;RESULT for the vote to donate the Apache Arrow Erlang implementation&lt;/a&gt;, which carried with four binding +1s from Sutou, Curt Hagenlocher, Matt Topol, and David Li. The next step is the IP clearance process and a vote on the incubator general list. Benjamin Philip, who has been shepherding the contribution, had earlier worked through the &lt;a href="https://lists.apache.org/thread/8rvtgmhhjqfcv0mp53qngbnlh8rpsmoo" rel="noopener noreferrer"&gt;grant documents for the Erlang implementation&lt;/a&gt;, filling out the IP clearance template, the contributor license agreement, and the software grant. The Erlang library is built on bindings to the Rust implementation, which is itself a nice illustration of how Arrow's investment in a strong Rust core is now letting new language bindings come together faster than a from-scratch implementation ever could. Every new language Arrow speaks is another place its columnar format becomes the default interchange layer, and getting there by wrapping Rust rather than reimplementing C++ is the efficient path.&lt;/p&gt;

&lt;p&gt;The release engineering that has become Arrow's signature continued without drama. Andrew Lamb posted RESULT threads for three Rust releases in close succession: &lt;a href="https://lists.apache.org/thread/ztfxxb7n0476ct1jzms2po923wdlnxr4" rel="noopener noreferrer"&gt;arrow-rs 56.2.1&lt;/a&gt;, &lt;a href="https://lists.apache.org/thread/4kjf39rdj1ydqqmbzx1nwrz8ppmg1q97" rel="noopener noreferrer"&gt;57.3.1&lt;/a&gt;, and &lt;a href="https://lists.apache.org/thread/hwf7jl88n9t9qs7rpzqc0zmf83gvy8mz" rel="noopener noreferrer"&gt;58.3.0&lt;/a&gt;, each approved with five +1 votes. Shipping three point releases across three maintenance lines in a single stretch is the kind of cadence that signals a project with its release automation thoroughly sorted out. Rok Mihevc also closed the loop on the &lt;a href="https://lists.apache.org/thread/hd9f4rx41go3vvzdrblvmj5pzp3mhcz6" rel="noopener noreferrer"&gt;pyarrow-stubs donation&lt;/a&gt;, confirming that the software grant has been formally filed and published, which concludes that process and brings type stubs for PyArrow into the project proper.&lt;/p&gt;

&lt;p&gt;The design discussions leaned toward type system and protocol extensions. Florian Hölzlwimmer proposed an &lt;a href="https://lists.apache.org/thread/tcv35l8o8d33n176kb3qv4y45obcgjbn" rel="noopener noreferrer"&gt;arrow.range canonical extension type for bounded ranges&lt;/a&gt;, filling a gap in Arrow's type vocabulary. Tornike Gurgenidze opened two threads on the Flight and ADBC side: a &lt;a href="https://lists.apache.org/thread/82vzy0xpdnl8dwjw8979jkwbckxypct7" rel="noopener noreferrer"&gt;partitioned bulk ingest API for ADBC&lt;/a&gt; that would mirror the existing ExecutePartitions and ReadPartition read-side primitives on the write side, and a proposal to &lt;a href="https://lists.apache.org/thread/31b23z92vmd5vpp9p9z17941v5lg90zd" rel="noopener noreferrer"&gt;add dialect-related SqlInfo codes to FlightSQL&lt;/a&gt; so clients have a standard way to learn what SQL features a backend supports. There was also continued work on a Flight SQL field to &lt;a href="https://lists.apache.org/thread/sg1d3hwt1hlgzgh16wzbkrb0pzgqsf3n" rel="noopener noreferrer"&gt;signal whether a prepared statement is an update&lt;/a&gt;, with Jean-Baptiste Onofré suggesting the vote be extended to give more people time to weigh in.&lt;/p&gt;

&lt;p&gt;The thread that will resonate beyond Arrow was Sutou Kouhei's &lt;a href="https://lists.apache.org/thread/thq0dz19shxbrjypb81q5ltx8h0w54ob" rel="noopener noreferrer"&gt;discussion on enabling automatic GitHub Copilot review&lt;/a&gt; on apache/arrow pull requests. His framing was pragmatic: the project does not have enough human review bandwidth, and a Copilot pass could catch trivial problems before a human reviewer ever looks. This is the same underlying tension that Iceberg has been working through with its AI contribution guidelines, just approached from the reviewer side rather than the contributor side. It is a question every large open source project is going to have to answer, and seeing Arrow debate it openly on the dev list, weighing the value against the noise, is exactly how these norms should get set.&lt;/p&gt;

&lt;p&gt;There was also a low-level performance thread worth a mention for the systems-minded: Dan Mattheiss opened a &lt;a href="https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40" rel="noopener noreferrer"&gt;discussion on AVX2 SBBF probe for parquet/bloom_filter.cc&lt;/a&gt;, noting that arrow-go already shipped SIMD bloom filter probes and proposing the C++ side catch up. It is a reminder that the cross-language consistency Arrow promises also means keeping performance optimizations roughly in step across implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet's headline was the close of a long-running saga. Gang Wu opened the &lt;a href="https://lists.apache.org/thread/h0k0hqo0sojqphnbnrkp8b0gmwdzq9on" rel="noopener noreferrer"&gt;vote to adopt the format change for PARQUET-2249, covering IEEE 754 total order and NaN-counts&lt;/a&gt;, drawing eight participants. This proposal has been circulating in various forms for a long time, and the &lt;a href="https://lists.apache.org/thread/pxzhf5hb3kjmsofhbnoojr0mzlw1xnms" rel="noopener noreferrer"&gt;discussion thread on adding nan_count to handle NaNs in statistics&lt;/a&gt; had a "bumping this one last time" quality to it before the vote finally opened. The substance matters more than the procedural relief. Floating-point NaN values break the assumptions that min/max statistics rely on, which means engines either produce wrong results when pruning row groups that contain NaNs or disable statistics-based pruning entirely on float columns to be safe. Standardizing a total order and a NaN count fixes that at the format level, so every engine can prune float columns correctly. It is the kind of fix that sounds narrow and is actually load-bearing for query performance on a very common data type.&lt;/p&gt;

&lt;p&gt;The other major structural conversation was the footer. Jiayi Wang posted the &lt;a href="https://lists.apache.org/thread/drcbmj12lgy17tdy751ym25k9n8kh9rk" rel="noopener noreferrer"&gt;kickoff for the Parquet Footer Working Group&lt;/a&gt;, setting up a dedicated forum to move the footer redesign forward more efficiently after it was discussed at the Parquet sync. The footer is where Parquet stores its metadata, and how it is structured determines how quickly a reader can open a file and figure out what is in it, which is increasingly a bottleneck as files get larger and workloads get more selective. Pierre Lacave contributed to the related &lt;a href="https://lists.apache.org/thread/rysx51probkqrzlc7tlr08dnx803hb0y" rel="noopener noreferrer"&gt;discussion on an alternative to the FlatBuffer footer, a lightweight byte-offset index&lt;/a&gt;, sharing that a similar pattern is in use in a custom file format his team is migrating toward Parquet. Standing up a working group is a signal that the community sees footer evolution as a multi-release effort that deserves focused attention rather than ad hoc threads.&lt;/p&gt;

&lt;p&gt;Release planning got going too. Fokko Driesprong opened a &lt;a href="https://lists.apache.org/thread/y0vjr64ofs4mftl23gy3b2twngjr9rr6" rel="noopener noreferrer"&gt;discussion on an Apache Parquet 1.18.0 release&lt;/a&gt;, noting that a lot of work has accumulated since the last major release and that it is overdue. Ismaël Mejía proposed &lt;a href="https://lists.apache.org/thread/jzjx3wcgo800166myz0k1993w8gwvd0b" rel="noopener noreferrer"&gt;bumping the minimum Java version for Parquet Java to 17&lt;/a&gt;, pointing out that Java 17 has been the baseline LTS since September 2021 and that holding the floor at 11 is increasingly costly. Mejía was also active on the performance front, &lt;a href="https://lists.apache.org/thread/r8wymql5k8550mxjqv92479fpcq3kfv6" rel="noopener noreferrer"&gt;sharing encoding and decoding hot-path optimizations and asking for code reviews&lt;/a&gt; on work he presented at the Parquet community sync. On the safety side, Steve Loughran circulated a &lt;a href="https://lists.apache.org/thread/0ow88ht69gdwypnn8gb7gjrr13lxf898" rel="noopener noreferrer"&gt;pull request hardening the variant readers&lt;/a&gt;, noting that while a malformed 1KB file triggering a multi-gigabyte allocation is not strictly a security issue, it is close enough to be worth fixing. And the community recognized that contribution work with an &lt;a href="https://lists.apache.org/thread/flxj8w6pc9pgqpv59qyy89jrzk9bwwtw" rel="noopener noreferrer"&gt;announcement that Ed Seidl has accepted an invitation to become a committer&lt;/a&gt;. There was also continued interest in the geospatial story, with Dewey Dunnington noting in a thread on &lt;a href="https://lists.apache.org/thread/rvh9c9m70dzwb913dt3ynfyx5qsjf7x8" rel="noopener noreferrer"&gt;geography test files with statistics&lt;/a&gt; that he had added geography statistics writing to SedonaDB via arrow-rs, closing a gap that had been flagged when the geospatial types blog post came out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;Two patterns connect these four lists this week, and both say something about where the lakehouse stack is heading.&lt;/p&gt;

&lt;p&gt;The first is that encryption has become a cross-project program rather than a single project's feature. Iceberg shipped the base table encryption implementation in 1.11, and within the same week Polaris opened a thread on how the catalog should support it. Encryption that protects data files but leaks key management to every client is not real protection, so the catalog has to be the trusted party that wraps, unwraps, and hands out keys under policy. You cannot reason about Iceberg encryption by reading the Iceberg list alone; the design only closes when you read the Polaris thread next to it. That is the lakehouse working as a coordinated platform, where a capability is split across the format and the catalog by design and the two communities build their halves in step.&lt;/p&gt;

&lt;p&gt;The second is that every one of these projects is now wrestling, openly and on the record, with how AI fits into its development process. Arrow debated whether to let Copilot review pull requests. Polaris explored an agentic eval meta-skill as first-class project tooling. Iceberg has its AI contribution guidelines work, and the word "agentic" is showing up in the Polaris topic cloud. These are not the same question, but they rhyme. The community is deciding, in the open, what role AI tools should play in producing and reviewing the code that underpins the open data stack, and it is doing so transparently rather than letting individual contributors quietly make those choices alone. The decisions made over the next few months will set norms that stick for years, and it matters that they are being made on public dev lists where the whole community can see the reasoning.&lt;/p&gt;

&lt;p&gt;There is a quieter third theme worth naming: protocol and format extensibility. Iceberg's client capabilities header, Arrow's FlightSQL dialect codes, and Parquet's footer working group are all the same instinct expressed in three places. Each project has reached the point where it has many independent implementations on different release schedules, and the central task is no longer adding features but building the negotiation and versioning seams that let those implementations evolve without breaking each other. That is what maturity looks like for an interoperability standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;The Iceberg Go v0.6.0 and PyIceberg 0.12.0 releases should close in the coming days, and the C++ 0.3.0 discussion will likely firm up into a release plan. Watch the X-Iceberg-Client-Capabilities header thread, because if it gains traction it becomes the mechanism through which a lot of future REST evolution gets negotiated. On the Polaris side, the encryption support thread is the one to follow, since it is the catalog half of the story Iceberg started in 1.11, and the Delegation Service design doc should continue to take shape. Arrow's Erlang donation moves to the incubator general list for IP clearance, and the Copilot review discussion is worth watching as a bellwether for how Apache data projects handle AI in their workflows. For Parquet, the PARQUET-2249 vote should close and move into implementation across engines, the Footer Working Group will likely publish a charter and cadence, and the 1.18.0 release planning plus the Java 17 baseline proposal will shape what the next Parquet Java looks like. The through-line for the weeks ahead is the same one that defined this week: the interesting work is increasingly in the layers that connect the projects to each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-05-27&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; lets you build your lakehouse on Iceberg with a free trial.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-05-27&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; walks through how Dremio brings the open lakehouse stack together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;, the O'Reilly book, is available as a free download.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt;, the O'Reilly book, is available as a free download.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI Weekly: Cheaper Coding Models, Custom Chips, and a Stateless MCP</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 27 May 2026 14:19:26 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/ai-weekly-cheaper-coding-models-custom-chips-and-a-stateless-mcp-963</link>
      <guid>https://dev.to/alexmercedcoder/ai-weekly-cheaper-coding-models-custom-chips-and-a-stateless-mcp-963</guid>
      <description>&lt;p&gt;The past week pushed three quiet shifts into the open. A coding model matched the frontier at a tenth of the cost. Custom chips started outgrowing Nvidia. And the protocol behind most AI agents got its biggest rewrite yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: Cursor Ships Composer 2.5 and the Price of Frontier Coding Drops
&lt;/h2&gt;

&lt;p&gt;Cursor released Composer 2.5 on May 18. The headline is not the benchmark score. It is the price next to that score.&lt;/p&gt;

&lt;p&gt;Composer 2.5 scores 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1. Those numbers sit right next to Claude Opus 4.7 and GPT-5.5 on the same tests. The standard tier costs $0.50 per million input tokens and $2.50 per million output tokens. That works out to roughly one-tenth the cost per task of the frontier models it matches.&lt;/p&gt;

&lt;p&gt;The model runs on Moonshot AI's open-source Kimi K2.5 checkpoint. Cursor spent about 85% of its compute budget on its own post-training, including reinforcement learning on 25 times more synthetic coding tasks than Composer 2 used. The base model came from a Beijing lab. The behavior that developers actually feel came from Cursor's training pipeline. That split tells you something about where value lives in 2026. The base weights are increasingly a commodity. The post-training is the product.&lt;/p&gt;

&lt;p&gt;Composer 2.5 is built for long, tool-heavy sessions. It reads files, runs terminal commands, edits across many files, runs tests, and iterates on its own. Cursor tuned it for sustained work and better instruction following, not just raw puzzle-solving. The jump from Composer 2 was real: SWE-Bench Multilingual went from 73.7% to 79.8%, and Terminal-Bench 2.0 went from 61.7% to 69.3% in two months.&lt;/p&gt;

&lt;p&gt;The model is not the whole story for Cursor this month. The editor is turning into something closer to a control panel for a whole team's agents. Cursor 3.3 landed on May 7 with Build in Parallel, a feature that dispatches async subagents across independent steps of a plan at the same time. Cursor 3.5 followed on May 20 with multi-repo automations and shared canvases for team artifact access. The pattern is clear. Cursor wants to be the place where you manage many agents, not just the place where you autocomplete one line.&lt;/p&gt;

&lt;p&gt;Google answered fast. On May 19, one day after Composer 2.5 shipped, Google launched Antigravity 2.0 with Gemini 3.5 Flash at I/O 2026. Antigravity 2.0 targets the same agentic IDE seat as Cursor. It pairs multi-agent orchestration with a built-in Chromium browser, dynamic subagents, and scheduled background tasks. Two of the biggest names in the space shipped competing agentic IDEs within 24 hours of each other. That cadence is the real signal.&lt;/p&gt;

&lt;p&gt;Here is the part worth sitting with. Composer 2.5 does not beat Opus 4.7 or GPT-5.5 outright. On CursorBench v3.1, a test built around real Cursor workflows, it edges ahead. On Terminal-Bench 2.0 it ties Opus 4.7 at 69.3% but trails GPT-5.5 at 82.7%. Opus keeps an edge on deep architectural reasoning and long single-shot generation. So the frontier still leads on the hardest work. What changed is that "good enough for most tasks" now costs a fraction of what it did, and it runs inside the editor where the work already happens.&lt;/p&gt;

&lt;p&gt;For teams, this resets the math. A year ago the question was which single coding tool to standardize on. That question is gone. JetBrains research from April found that 90% of developers used at least one AI tool at work as of January 2026, and 74% used specialized development tools beyond plain chat. GitHub Copilot stayed the most adopted at 29% workplace usage, with Cursor and Claude Code tied at 18%. Most teams now run two or three tools in different roles. A common setup pairs Claude Code in the terminal for agentic work, Cursor or Copilot in the IDE for inline edits, and a chat window for thinking through design.&lt;/p&gt;

&lt;p&gt;The terminal-native agents keep gaining ground at the high end. The JetBrains 2026 survey recorded Claude Code jumping from 3% adoption in April 2025 to 18% in January 2026, a sixfold rise in nine months. The starker number is senior developer preference. When JetBrains asked developers with more than ten years of experience which tool they would pick for daily work, 46% chose Claude Code and 9% chose Copilot. The anthropics/claude-code repository now counts more than 126,000 GitHub stars. Codex passed 3 million weekly active users in March, up from 2 million a month earlier. None of these tools is winning outright. Each owns a slice.&lt;/p&gt;

&lt;p&gt;The interesting question for teams is which tool acts as the controller and which ones do the subtasks. Most teams now run a terminal-native agent like Claude Code as the controller and hand specific jobs to Codex or Cursor. That arrangement is not stable. It has shifted twice already this year and will likely shift again before December. Picking a permanent stack right now is a bet against your own future workflow.&lt;/p&gt;

&lt;p&gt;One more change is reshaping the field, and it is about money, not models. GitHub paused new sign-ups for Copilot Pro and Pro+ in April. Copilot moves to AI Credits-based flex billing on June 1, keeping the $10 and $39 prices but swapping in credit pools. A new Copilot Max tier targets heavy individual users. Windsurf raised Pro from $15 to $20 a month and added a $200 Max plan bundling Devin. Cursor included double usage for the first week after Composer 2.5 to pull developers in for evaluation. The tools are competing on cost structure now, not only capability.&lt;/p&gt;

&lt;p&gt;The plain advice for a working developer in 2026 holds up well. A solo developer or hobbyist gets the best entry value from Copilot at $10 a month. A full-time developer tends to pay for itself with Cursor Pro at $20 a month within the first week. A senior developer or technical founder who lives in the terminal gets the most from Claude Code on a higher tier, where the agentic depth justifies the price. Many people use more than one of these at once, and that is the sane default rather than a sign of indecision.&lt;/p&gt;

&lt;p&gt;Microsoft Build runs June 2 and 3 in San Francisco, so its announcements land just after this issue. The smaller two-day format and the agenda point at agents as the throughline. The seven session tracks include Agents and Apps, Azure AI Foundry, and a track on working with models. Microsoft framed 2026 as the year agentic tooling moves from announced to production-ready. Expect multi-agent orchestration, new APIs for deploying autonomous agents, and updates to Microsoft's MCP integrations. We will cover the actual announcements next week.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: Custom Chips Start Outgrowing Nvidia
&lt;/h2&gt;

&lt;p&gt;A shipment forecast this week marked a turning point in AI hardware. For the first time, custom AI chips are projected to outgrow Nvidia's GPUs.&lt;/p&gt;

&lt;p&gt;TrendForce projects 44.6% growth in ASIC shipments for 2026 against 16.1% growth for merchant GPUs. Alchip Chairman Johnny Shen confirmed the shift in comments reported by Digitimes on May 26. ASIC stands for application-specific integrated circuit, a chip designed for one job rather than general flexibility. The growth gap says buyers are moving toward purpose-built silicon faster than they are buying more general-purpose GPUs.&lt;/p&gt;

&lt;p&gt;The reason is straightforward. Nvidia's GPUs are general-purpose processors. They are powerful and flexible, built to run nearly any AI workload, but not tuned for any single one. AI inference, the ongoing job of running trained models against live queries, has overtaken training as the dominant compute load. For inference at scale, that flexibility carries a cost you no longer need to pay. A chip built for one model architecture can run it cheaper and cooler than a GPU that can run anything.&lt;/p&gt;

&lt;p&gt;This does not mean Nvidia is in trouble. Nvidia still holds roughly 70% to 80% of the AI accelerator market by revenue. Total Nvidia AI accelerator revenue could pass $150 billion in 2026. Losing share in percentage terms is not the same as losing money when the whole market is growing this fast. The real pressure on Nvidia is margin, not market exit. As hyperscalers diversify suppliers, they gain leverage to push pricing on next-generation parts.&lt;/p&gt;

&lt;p&gt;The custom-chip wave has clear backers. Broadcom co-designs Google's Tensor Processing Units and chips for Meta and others. It reported $8.4 billion in AI semiconductor revenue in a recent quarter. Alchip is forecasting a return to growth in 2026 as new 3-nanometer accelerator programs hit volume in the second quarter, with about 80% of its revenue landing in the second half of the year. Every major hyperscaler now ships in-house silicon: Google with TPU, AWS with Trainium, Microsoft with Maia, and Meta with its own designs.&lt;/p&gt;

&lt;p&gt;Nvidia is not standing still. Its Vera Rubin architecture, the successor to Blackwell, is in full production, with partner products arriving in the second half of 2026. Rubin is built on TSMC's 3-nanometer process with HBM4 memory and 336 billion transistors. Nvidia reports it cuts inference token costs by 10 times and reduces the GPUs needed to train mixture-of-experts models by 4 times compared to Blackwell. AWS, Google Cloud, Microsoft, and Oracle are among the first cloud providers set to deploy Vera Rubin instances. The architecture is tuned for mixture-of-experts models, the same design trend showing up across the field.&lt;/p&gt;

&lt;p&gt;AMD is running its own play. The Instinct MI400 launches in the second half of 2026 with 432GB of HBM4 memory and 40 petaflops of FP4 compute. S&amp;amp;P Global projects the MI400 will generate $7.2 billion in its first year, and AMD's data center GPU revenue is forecast to grow 114% year over year to $15 billion. AMD also locked in a multi-generational deal with Meta covering a 6-gigawatt deployment, the first tranche using MI450-based custom GPUs.&lt;/p&gt;

&lt;p&gt;What does this mean if you build with AI rather than sell chips? Inference is getting cheaper, and the savings will reach your bills. The same shift that lets Cursor sell near-frontier coding at a tenth of the cost is happening one layer down in silicon. Purpose-built inference chips, mixture-of-experts models that activate only the parameters they need, and architectures tuned for serving instead of training all point the same direction. Running models in production is on a steady path to costing less, which changes what is worth building.&lt;/p&gt;

&lt;p&gt;There is a catch that keeps Nvidia ahead even as custom chips grow faster. Its CUDA software ecosystem has more than a decade of tools, libraries, and developer habits built on top of it. Moving a workload off Nvidia means rewriting or re-tuning the code that runs on it, and that switching cost is real. Custom ASICs win where the workload is fixed and the volume is huge enough to justify the engineering, which describes a hyperscaler running one model at massive scale. It does not yet describe most teams, who still benefit from the flexibility of a GPU that runs whatever they throw at it.&lt;/p&gt;

&lt;p&gt;The other trend to watch is the move to the edge. NPUs in laptop-class chips from Intel, AMD, and Apple now deliver 40 to 50 TOPS of on-device inference. That is enough to run capable local models without a round trip to the cloud. The hybrid pattern, cloud for the hard reasoning and the device for latency-sensitive work, is becoming the default shape for AI apps rather than a niche choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards and Protocols: MCP Gets Its Biggest Rewrite, and the NSA Weighs In
&lt;/h2&gt;

&lt;p&gt;Two things happened to the Model Context Protocol this week, and they pull in opposite directions. The protocol got its largest revision since launch, and the NSA published its first formal security guidance for it. Both matter if you build agents.&lt;/p&gt;

&lt;p&gt;MCP is the open standard, created by Anthropic in late 2024, that lets AI models connect to external tools, databases, and services through one common interface. Instead of writing custom glue code for every new integration, you connect through MCP once. It has become a core building block of the agentic AI stack.&lt;/p&gt;

&lt;p&gt;On May 21, the maintainers locked the release candidate for MCP 2026-07-28. The final spec publishes on July 28. The ten-week gap gives SDK maintainers and client builders time to validate the changes against real workloads. This is the biggest revision since the protocol launched.&lt;/p&gt;

&lt;p&gt;The centerpiece is a stateless protocol core. The new version removes the session ID, the initialize handshake, and resumable streams. In plain terms, the protocol now runs on ordinary HTTP infrastructure without holding a connection open or tracking session state on the server. That is a major change for anyone running MCP in production. Stateless services scale across commodity servers far more easily than stateful ones, since any server can handle any request. Load balancing gets simpler, and recovery from a crashed node stops being a problem.&lt;/p&gt;

&lt;p&gt;The release brings more than the stateless core. An Extensions framework lets capabilities ship on their own timeline instead of waiting for a full spec release. Two extensions arrive with it. MCP Apps allow server-rendered user interfaces, so a tool can return a real UI rather than plain text. The Tasks extension handles long-running work, the kind of job that takes minutes or hours instead of returning instantly. Authorization now aligns more closely with OAuth and OpenID Connect, which matters for enterprise deployments. A formal deprecation policy means the protocol can change without breaking what teams already built.&lt;/p&gt;

&lt;p&gt;The timing of the NSA guidance is no accident. On May 20, the NSA's Artificial Intelligence Security Center published a Cybersecurity Information Sheet titled "Model Context Protocol: Security Design Considerations for AI-Driven Automation." The document runs 17 pages, carries identifier U/OO/6030316-26, and is the most careful security treatment of MCP to date.&lt;/p&gt;

&lt;p&gt;The core warning is structural. MCP flips the usual security model. In a normal API, clients query servers for data. With MCP, servers query data and execute actions on behalf of clients. That inversion means the mental model most engineers use to reason about API security points the wrong way. Access control is optional at the protocol level, which is exactly the gap the NSA flags.&lt;/p&gt;

&lt;p&gt;The guidance names specific risks. Serialization flaws can let bad input trigger unstable behavior. Trust boundary failures let one component over-reach into another. Unverified task propagation lets tasks pass between MCP servers without checking their origin, scope, or intent, which can leak sensitive context or fire unrelated tools. Session weaknesses can allow message replay or session hijacking. The NSA's central point is that these problems cannot be patched at isolated endpoints. They have to be addressed across the whole MCP environment.&lt;/p&gt;

&lt;p&gt;The practical advice is concrete. Validate every tool invocation against defined schemas, expected ranges, and the intended execution context. Log all tool and model invocations with their exact parameters and the identities involved. Use a filtering outgoing proxy or enterprise data-loss prevention for external MCP connections, with resource URLs pinned tightly enough to limit leakage. Prefer a local MCP server instance when processing private data. Align tools and models with data classification zones, so public tools handle public data and sensitive tools stay segregated and explicitly controlled.&lt;/p&gt;

&lt;p&gt;Read the two documents together and a picture forms. MCP adoption has outrun its governance. The protocol is now embedded in production workflows across finance, legal, and software, which means the NSA is describing live exposure in regulated industries, not a hypothetical. MCP stacks built in 2024 and early 2025 likely lack the authentication and privilege isolation now considered baseline. The 2026-07-28 spec hardens authorization and brings the stateless core, and the NSA guidance gives teams a checklist while they wait for it. If you ship anything touching MCP, both belong on your reading list this week.&lt;/p&gt;

&lt;p&gt;If you run agents in production, this week gives you a short to-do list. Audit your MCP servers for the access controls the NSA flags, since the protocol will not enforce them for you. Log every tool call with its parameters and the identity behind it, because you cannot investigate what you did not record. Plan for the stateless migration ahead of the July 28 final spec, especially if your current setup leans on session IDs or resumable streams. And test at least one of the cheaper coding models against your real workload before your next billing cycle, since the cost gap is now large enough to matter at team scale.&lt;/p&gt;

&lt;p&gt;The thread tying all three categories together is maturation. Coding models are competing on cost because capability has spread. Chips are specializing because workloads have settled into clear shapes. And the protocol layer is being rewritten for scale and locked down for security because it is running real production systems now. The experimental phase is closing. The infrastructure phase is here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt;: Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-27-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt;: Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-27-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt;: Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-27-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt;: Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt;: A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>mcp</category>
      <category>news</category>
    </item>
    <item>
      <title>Single-Node Data Engineering: DuckDB, DataFusion, Polars, and LakeSail</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Sun, 24 May 2026 00:50:59 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/single-node-data-engineering-duckdb-datafusion-polars-and-lakesail-mai</link>
      <guid>https://dev.to/alexmercedcoder/single-node-data-engineering-duckdb-datafusion-polars-and-lakesail-mai</guid>
      <description>&lt;p&gt;For the past decade, data engineering was synonymous with distributed clusters. If your dataset exceeded a few gigabytes, standard practice dictated spinning up an Apache Spark cluster on AWS EMR or Databricks. This distributed paradigm introduced massive operational complexity: managing JVM configurations, allocating executors, tuning shuffle partitions, and paying a substantial "serialization tax" to move data across network sockets and language runtimes. &lt;/p&gt;

&lt;p&gt;Recently, the data engineering landscape has experienced a single-node renaissance. Rather than scaling out to distributed clusters, teams are scaling up on single machines. Modern laptops ship with 12 or more CPU cores, fast NVMe SSDs capable of multi-gigabyte-per-second read throughput, and up to 128 GB of RAM. Cloud providers offer single virtual machines with hundreds of cores and terabytes of memory for a fraction of the cost of a Kubernetes or Spark cluster.&lt;/p&gt;

&lt;p&gt;This physical hardware evolution is only half the story. The true catalyst is a new generation of data technologies built on Apache Arrow, vectorized execution, and out-of-core memory management. Tools like DuckDB, Apache Arrow DataFusion, Polars, and LakeSail enable a single laptop or VM to process hundreds of gigabytes—and even terabytes—of data. You can now execute complex analytical pipelines locally or on a single node without the overhead of a distributed JVM runtime.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuaybwsir7egzinto7160.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuaybwsir7egzinto7160.png" alt="Architecture diagram showing the single-node data engineering ecosystem from local laptops to single-node engines querying S3" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Foundations: Columnar Memory and Apache Arrow
&lt;/h2&gt;

&lt;p&gt;To understand how single-node data engineering can process datasets that previously required hundreds of cluster nodes, you must look at how data is structured in memory.&lt;/p&gt;

&lt;p&gt;Traditional databases and processing runtimes designed for transactional workloads (OLTP) use row-oriented layouts. They store all fields of a single record contiguously in memory: &lt;code&gt;[User_ID, Age, Name]&lt;/code&gt;, followed by the next record. When executing analytical queries (OLAP) that only target a subset of columns (such as calculating the average age of users), a row-oriented engine must scan the entire record structure from memory. This process loads irrelevant data (like names and IDs) into the CPU's L1/L2 caches, leading to cache pollution and wasted memory bandwidth.&lt;/p&gt;

&lt;p&gt;Columnar query engines solve this inefficiency by storing data contiguously by column: &lt;code&gt;[Age, Age, Age]&lt;/code&gt; in one buffer, and &lt;code&gt;[Name, Name, Name]&lt;/code&gt; in another. The CPU only reads the specific columns required by the query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row-Oriented Layout (OLTP):
┌──────────────────────────────┬──────────────────────────────┐
│ ID 1 │ Age 1 │ Name 1        │ ID 2 │ Age 2 │ Name 2        │
└──────────────────────────────┴──────────────────────────────┘

Columnar Layout (Arrow/OLAP):
┌──────────┬──────────┐ ┌──────────┬──────────┐ ┌──────────┬──────────┐
│ ID 1     │ ID 2     │ │ Age 1    │ Age 2    │ │ Name 1   │ Name 2   │
└──────────┴──────────┘ └──────────┴──────────┘ └──────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache Arrow standardizes this columnar memory layout. It defines an open-source, language-independent specification for in-memory columnar data. By establishing a shared memory format, Arrow eliminates the serialization tax that historically slowed down data pipelines. &lt;/p&gt;

&lt;p&gt;In traditional architectures, passing data between a Python script and a Java or C++ engine required serializing the data into a byte stream (like JSON or Protobuf) and deserializing it on the other side. This serialization tax frequently consumed up to 80% of the total query execution time. &lt;/p&gt;

&lt;p&gt;Arrow enables zero-copy Inter-Process Communication (IPC). Because Arrow represents data in memory exactly the same way across Python, Rust, and C++, different processes can memory-map (mmap) the same physical memory buffers. An engine can pass a dataset to Python for machine learning or visualization by exchanging memory pointers. No bytes are copied, and no serialization occurs.&lt;/p&gt;

&lt;p&gt;Furthermore, Arrow's contiguous memory alignment matches the layout of modern CPU cache lines, making it straightforward to utilize Single Instruction, Multiple Data (SIMD) instruction sets (such as AVX-512 on Intel/AMD or Neon on ARM). SIMD allows the CPU to apply a single instruction (such as a filter comparison or an arithmetic addition) to a vector of data points in a single clock cycle. This hardware-level parallelism turns data processing from a memory-bound or CPU-bound bottleneck into an efficient operation running directly on the processor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0kw9z4jwxnxnwv5quoxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0kw9z4jwxnxnwv5quoxx.png" alt="Comparison diagram showing the row-based layout versus Apache Arrow's columnar in-memory format and zero-serialization pointer exchange" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  In-Process SQL Powerhouse: DuckDB Architecture &amp;amp; Features
&lt;/h2&gt;

&lt;p&gt;DuckDB has become the standard database engine for single-node SQL analytics. Designed as an in-process analytical database, DuckDB runs directly inside the host process (such as a Python interpreter or a CLI binary) rather than as a separate server daemon. This eliminates the network socket latency and IPC overhead of client-server databases like PostgreSQL or Snowflake.&lt;/p&gt;

&lt;p&gt;DuckDB's execution engine utilizes a vectorized query execution model. Rather than processing data one row at a time (the Volcano iterator model) or processing entire columns at once (which overflows L1/L2 caches for large tables), DuckDB processes data in small, cache-friendly vectors. These vectors typically contain 2048 elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Volcano Model:       [Row 1] ──► [Operator] ──► [Row 2] ──► [Operator]
Column-at-a-time:    [Entire Column (10M rows)] ──► [Operator] (Overflows Cache)
Vectorized Model:    [Vector of 2048 rows] ──► [L1/L2 CPU Cache] ──► [Operator]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By keeping these vectors small enough to fit inside the CPU's L1/L2 cache, DuckDB minimizes memory bandwidth bottlenecks. The CPU executes operations on the vectors using SIMD instructions, keeping the execution pipelines saturated with data.&lt;/p&gt;

&lt;p&gt;To handle datasets that exceed physical RAM, DuckDB implements out-of-core execution. When memory consumption reaches a user-defined limit, DuckDB's buffer manager automatically spills intermediate query states (such as hash join tables, sorting buffers, or aggregation states) to temporary disk files. This spilling mechanism uses a block-based buffer pool that page-faults data to disk, allowing you to run queries on datasets that are multiple times larger than your system's RAM.&lt;/p&gt;

&lt;p&gt;In the latest v1.5.3 release (May 2026), DuckDB has introduced several updates that expand its single-node utility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Quack Remote Protocol:&lt;/strong&gt; DuckDB now ships with a core extension implementing the Quack protocol. This protocol allows users to run DuckDB in a client-server configuration when needed, facilitating remote attachments and remote query orchestration without losing the simplicity of the engine.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Ecosystem and Format Updates:&lt;/strong&gt; The Iceberg extension has been upgraded to support &lt;code&gt;MERGE INTO&lt;/code&gt; operations, making it possible to execute complex delta updates on Iceberg tables directly from a local DuckDB session.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS Security and IRSA:&lt;/strong&gt; Native support for IAM Roles for Service Accounts (IRSA) has been added, simplifying secure S3 access when running DuckDB inside containerized single-node pipelines.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Static Linking:&lt;/strong&gt; The distribution now statically links &lt;code&gt;jemalloc&lt;/code&gt; on Linux platforms, improving memory allocation speed and reducing fragmentation during heavy out-of-core spilling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following Python script illustrates how to configure DuckDB's memory limits, register an S3 credential using the new AWS extension features, and run a query that spills to disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize DuckDB connection
&lt;/span&gt;&lt;span class="n"&gt;con&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local_cache.db&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set memory limit to force out-of-core spilling on smaller datasets
&lt;/span&gt;&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET max_memory=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;8GB&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET temp_directory=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./duckdb_temp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load S3 and AWS extensions (built-in in v1.5.3)
&lt;/span&gt;&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSTALL aws;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOAD aws;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Autodetect AWS credentials from environment (supports IRSA)
&lt;/span&gt;&lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CALL load_aws_credentials();&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query a large Parquet dataset directly on S3 with predicate pushdown
# DuckDB only downloads the columns and row groups that match the filter
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT 
        user_id, 
        COUNT(event_id) as event_count,
        AVG(session_duration) as avg_duration
    FROM read_parquet(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://my-lakehouse/bronze/events/**/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
    WHERE event_date &amp;gt;= &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    GROUP BY user_id
    HAVING event_count &amp;gt; 1000
    ORDER BY avg_duration DESC
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# Execute and stream results
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchdf&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DuckDB's combination of SQL support, vectorized performance, and out-of-core stability makes it a core tool for local analytical workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh2z3vgenqcaqgkewcjtn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh2z3vgenqcaqgkewcjtn.png" alt="DuckDB vectorized execution architecture showing chunked vector pipelines inside CPU cache and out-of-core spilling to SSD temp files" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Extensible Rust Processing: Apache Arrow DataFusion
&lt;/h2&gt;

&lt;p&gt;While DuckDB is packaged as an analytical database, Apache Arrow DataFusion is designed as an extensible query engine framework. Written in Rust and utilizing Apache Arrow as its native memory format, DataFusion is widely used to build other databases, query engines, and custom data platforms (including Bauplan, Spice.ai, and LakeSail).&lt;/p&gt;

&lt;p&gt;DataFusion's design is modular. It decouples the query planning, optimization, and execution stages. If you are building a custom data tool, you can register custom catalogs, write user-defined logical optimization rules (like custom predicate pushdowns), or plug in custom physical execution nodes.&lt;/p&gt;

&lt;p&gt;For thread-level parallelism, DataFusion utilizes Rust's asynchronous Tokio runtime. Rather than pinning execution to a fixed number of threads, DataFusion distributes physical plan fragments (represented as asynchronous streams of Arrow &lt;code&gt;RecordBatch&lt;/code&gt; objects) across a Tokio worker thread pool. This allows the engine to adapt to multi-core architectures and avoid thread contention under heavy I/O loads.&lt;/p&gt;

&lt;p&gt;In the recent v53.x and v54.x releases (early-to-mid 2026), the DataFusion community has introduced several optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Datetime Predicate Preimages:&lt;/strong&gt; DataFusion now optimizes queries containing datetime functions (like &lt;code&gt;date_trunc&lt;/code&gt; and &lt;code&gt;date_part&lt;/code&gt;) by evaluating their mathematical "preimages." Instead of executing the datetime function on every row, the optimizer rewrites the filter predicate against the raw partition bounds, enabling partition pruning.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Sort Pushdown Phase 2:&lt;/strong&gt; The engine now sorts file groups by physical statistics before executing sort operators. If a set of Parquet files contains non-overlapping sorted ranges, DataFusion skips the global sort merge step, reducing planning and CPU execution times.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Null-Aware Anti-Joins:&lt;/strong&gt; Support has been optimized for null-aware anti-joins, which frequently occur in SQL queries containing &lt;code&gt;NOT IN&lt;/code&gt; clauses.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Variant Type Integration:&lt;/strong&gt; The planner has introduced initial support for the binary &lt;code&gt;VARIANT&lt;/code&gt; format, laying the groundwork for format-agnostic semi-structured data querying.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following Rust code snippet demonstrates how to initialize a DataFusion context, register an in-memory Arrow table, and execute a query programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;datafusion&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;prelude&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;datafusion&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;arrow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;record_batch&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;RecordBatch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;datafusion&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;arrow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;DataType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Schema&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;datafusion&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;arrow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;array&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Int32Array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringArray&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;datafusion&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;error&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Create a local execution context&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SessionContext&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Define a simple schema&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Schema&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nn"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;DataType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nn"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;DataType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Utf8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]));&lt;/span&gt;

    &lt;span class="c1"&gt;// Create Arrow arrays&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;id_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Int32Array&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;name_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;StringArray&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Alice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Bob"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Charlie"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"David"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Eve"&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="c1"&gt;// Build the record batch&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;RecordBatch&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;try_new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id_array&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name_array&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Register the record batch as an in-memory table&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="nf"&gt;.register_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute SQL query&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="nf"&gt;.sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT name FROM users WHERE id &amp;gt; 2"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Print the physical execution plan&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="nf"&gt;.show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This library-first model makes DataFusion the preferred choice for teams building specialized, high-performance data systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltefqt6kfo71i6ejyg31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltefqt6kfo71i6ejyg31.png" alt="DataFusion extensible Rust architecture showing SQL/DataFrame inputs compiled into physical plans running on Arrow memory, with pluggable catalogs and custom execution nodes" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Vectorized DataFrames: Polars Eager &amp;amp; Lazy Pipelines
&lt;/h2&gt;

&lt;p&gt;For developers working in Python, Rust, or JavaScript, DataFrames are the preferred API for data manipulation. While Pandas has been the standard in Python for a decade, it is single-threaded, has a high memory footprint (often requiring 5–10x the dataset size in RAM), and does not support query optimization.&lt;/p&gt;

&lt;p&gt;Polars is a Rust-native, Arrow-backed DataFrame library designed to replace Pandas. It is optimized for multi-core execution, utilizing a custom work-stealing CPU scheduler that distributes execution chunks across available cores.&lt;/p&gt;

&lt;p&gt;Polars offers two execution modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Eager API:&lt;/strong&gt; Executes operations immediately, step-by-step, mimicking Pandas' behavior. This mode is useful for interactive debugging in Jupyter Notebooks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Lazy API:&lt;/strong&gt; Builds a logical Directed Acyclic Graph (DAG) representing the pipeline. When you call &lt;code&gt;.collect()&lt;/code&gt;, Polars passes the DAG through a query optimizer. The optimizer applies several rules:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Projection Pushdown:&lt;/strong&gt; Only reads the columns explicitly referenced in the query.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Predicate Pushdown:&lt;/strong&gt; Moves filter operations as close to the storage layer as possible (pushing them down into the Parquet reader).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Common Subexpression Elimination:&lt;/strong&gt; Identifies duplicate calculations and executes them once.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Eager: Load File (All Columns) ──► Filter Rows ──► Select Columns
Lazy:  Query Planner ──► Push Filter &amp;amp; Select Into File Reader ──► Load File (Filtered &amp;amp; Pruned)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In 2026, the Polars team officially stabilized its streaming execution engine. This engine allows out-of-core DataFrame execution on datasets that exceed physical memory limits. The streaming engine now supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Streaming Merge and AsOf Joins:&lt;/strong&gt; Useful for temporal alignments (such as joining financial tick data or IoT sensor metrics).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Streaming Aggregations:&lt;/strong&gt; Complex statistical calculations (including skew, kurtosis, and entropy) can now run in streaming mode.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Direct Cloud Sinks:&lt;/strong&gt; Polars can stream data directly back to storage formats like Delta Lake (&lt;code&gt;sink_delta&lt;/code&gt;) and Apache Iceberg (&lt;code&gt;sink_iceberg&lt;/code&gt;) without materializing the intermediate tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To enable the streaming engine, developers configure Polars to use the streaming execution path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;

&lt;span class="c1"&gt;# Enable streaming engine affinity globally
&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_engine_affinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streaming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define a Lazy pipeline querying a folder of compressed CSVs
&lt;/span&gt;&lt;span class="n"&gt;lazy_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/raw_metrics/*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_utilization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean_cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;skew&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skew_cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Uses new streaming aggregations
&lt;/span&gt;    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean_cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;descending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Execute the query out-of-core using the streaming engine
# This will process files in batches, avoiding Out-Of-Memory (OOM) crashes
&lt;/span&gt;&lt;span class="n"&gt;result_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lazy_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polars' combination of an expressive DataFrame API, lazy query optimization, and stabilized streaming makes it a powerful engine for Python and Rust developers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6cqv6fdb7xbozlm3neg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6cqv6fdb7xbozlm3neg.png" alt="Polars query planning diagram showing Eager sequential execution vs Lazy DAG optimization pathways with projection and predicate pushdowns" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparative Analysis: Evaluating Single-Node Engines
&lt;/h2&gt;

&lt;p&gt;Choosing the right tool requires evaluating their architectural differences and primary API surfaces:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvo9viskw5bsyseavh30.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvo9viskw5bsyseavh30.png" alt="Comparative Analysis: Evaluating Single-Node Engines" width="632" height="698"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Tradeoffs to Consider
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;API Choice:&lt;/strong&gt; If your team writes standard SQL, DuckDB is the logical starting point. If you write procedural code, Polars' expression language is more expressive and easier to parallelize than SQL.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Extensibility vs. Out-of-the-Box Utility:&lt;/strong&gt; DuckDB and Polars are complete user-facing applications. DataFusion is an engine framework. You use DataFusion if you are building a custom database or need to modify how the physical query execution layer functions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Memory Footprint:&lt;/strong&gt; DataFusion and Polars generally maintain a lower memory footprint than DuckDB for in-memory operations due to Rust's memory management model and direct mapping to Arrow structures. However, DuckDB's buffer manager is more mature for highly complex queries that require massive disk spilling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Zero-JVM Spark: High-Performance Pipelines with LakeSail
&lt;/h2&gt;

&lt;p&gt;For teams with existing data codebases, the primary blocker to adopting single-node tools is the legacy API footprint. Many organizations have thousands of lines of Apache Spark code written in PySpark. Rewriting these pipelines to DuckDB SQL or Polars DataFrames is expensive and introduces validation risks.&lt;/p&gt;

&lt;p&gt;Historically, running PySpark locally required spinning up a local Spark cluster. This cluster runs on the Java Virtual Machine (JVM), which introduces significant configuration complexity and memory overhead. A default local Spark session can easily consume 4 GB of RAM just to start, even when processing a 10 MB CSV file. &lt;/p&gt;

&lt;p&gt;Furthermore, PySpark operates via a Py4J gateway bridge. When your PySpark code calls a Python User-Defined Function (UDF), the data must be serialized, sent from the JVM to a Python worker process, processed, serialized again, and sent back to the JVM. This cross-process serialization tax makes Python UDF execution in Spark slow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional PySpark UDF Path:
[JVM Executor] ──(Serialize via Py4J)──► [Python Worker] ──► [Run UDF] ──(Serialize)──► [JVM Executor]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LakeSail&lt;/strong&gt; (specifically the open-source &lt;strong&gt;Sail&lt;/strong&gt; engine) solves this constraint. Sail is a Rust-native, JVM-free compute engine designed as a drop-in replacement for Apache Spark. It implements the Spark Connect protocol, allowing existing PySpark and Spark SQL applications to run unmodified by connecting to a Sail server over gRPC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LakeSail PySpark Connect Path:
[PySpark Session] ──(Spark Connect gRPC Logical Plan)──► [LakeSail Rust Server] ──► [DataFusion Physical Execution]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, Sail replaces Spark's JVM-based Catalyst optimizer and Tungsten execution engine with Apache DataFusion and Apache Arrow. This architecture provides several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Zero JVM Overhead:&lt;/strong&gt; Sail starts in milliseconds and has a negligible idle memory footprint. You can run Spark code on small single-core VMs or local laptops.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Zero-Copy Python UDF Execution:&lt;/strong&gt; Sail embeds a Python interpreter directly into its Rust binary using PyO3. When executing a Python UDF, Sail passes pointers to the Arrow memory buffers directly to the Python interpreter. The UDF executes in-process without serialization, eliminating the cross-process Py4J bottleneck.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Native Open Formats:&lt;/strong&gt; Sail includes native Rust-based support for Delta Lake, Apache Iceberg, and Parquet, integrating directly with AWS Glue, Unity Catalog, and Polaris REST catalogs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To run your PySpark pipelines against a local Sail session, you install the packages and point the session builder to the local Sail gRPC port:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install pysail and PySpark client supporting Spark Connect&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pysail pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the Sail server from your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start local Sail gRPC server on port 50051&lt;/span&gt;
sail spark server &lt;span class="nt"&gt;--port&lt;/span&gt; 50051
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In your Python code, connect the &lt;code&gt;SparkSession&lt;/code&gt; to the local Sail server using the standard remote connection string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IntegerType&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to the local Sail Rust-native server over Spark Connect protocol
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://localhost:50051&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Load a local Parquet dataset using standard Spark DataFrame API
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/raw_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define a standard Python UDF
&lt;/span&gt;&lt;span class="nd"&gt;@udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;returnType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# This runs in-process via Sail's PyO3 integration
&lt;/span&gt;    &lt;span class="c1"&gt;# Zero serialization tax is paid between Rust and Python
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.08&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Execute transformations and show results
&lt;/span&gt;&lt;span class="n"&gt;processed_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tax&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;calculate_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

&lt;span class="n"&gt;processed_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By keeping the Spark API surface while replacing the execution engine, LakeSail allows teams to modernize their legacy PySpark pipelines and run them on single nodes without the overhead of a JVM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhl6fx06iozk30l9votyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhl6fx06iozk30l9votyv.png" alt="LakeSail Spark Connect architecture showing PySpark client communicating over gRPC to a Rust-native Spark Connect server with DataFusion and PyO3 embedded UDF zero-copy memory buffers" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Threshold of Scale: When Does Single-Node Break?
&lt;/h2&gt;

&lt;p&gt;While single-node data engineering has expanded the scale of data that can be processed on a single machine, it is not a silver bullet. At a certain point, physical resource constraints make single-node architectures impractical.&lt;/p&gt;

&lt;p&gt;The primary bottleneck is I/O. During out-of-core execution, spilling data to disk shifts the bottleneck from memory capacity to disk read/write bandwidth. Even on fast NVMe SSDs, writing and reading hundreds of gigabytes of intermediate join tables or sorting buffers introduces latency. If a query spends more time reading and writing temporary blocks to disk than it does executing CPU cycles, the system is I/O-bound.&lt;/p&gt;

&lt;p&gt;The second bottleneck is query planning and CPU execution scaling. If your query must scan multiple terabytes of data, even a vectorized engine running on 64 cores will take minutes to complete the scan. If your business SLAs require sub-second or low-second query latencies, you need to distribute the scanning and processing work across multiple machines in parallel.&lt;/p&gt;

&lt;p&gt;The third bottleneck is organizational concurrency. If a single VM hosts your analytical database, and hundreds of analysts or BI dashboards query it simultaneously, the CPU cores will experience thread starvation, and lock contention will slow execution times for all users.&lt;/p&gt;

&lt;p&gt;To guide your architectural transitions, use the following operational decision framework:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flspsyii6c0iewnemv6ay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flspsyii6c0iewnemv6ay.png" alt="To guide your architectural transitions, use the following operational decision framework" width="631" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexojeginv47p3of0hu9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexojeginv47p3of0hu9v.png" alt="Performance-cost threshold graph showing single-node vs MPP execution efficiency zones based on data scale" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The MPP Landscape: Scaling to Spark, Dremio, Bauplan, SpiceAI, and MotherDuck
&lt;/h2&gt;

&lt;p&gt;When your data scale, latency requirements, or concurrency needs exceed single-node limits, you must transition to a distributed MPP (Massively Parallel Processing) architecture. The modern MPP landscape offers several pathways, depending on your workflow patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  MotherDuck (Dual Execution)
&lt;/h3&gt;

&lt;p&gt;For teams who want to scale their DuckDB workloads to the cloud without managing infrastructure, MotherDuck provides a serverless platform built on DuckDB. &lt;/p&gt;

&lt;p&gt;MotherDuck's core architectural pattern is &lt;strong&gt;Dual Execution&lt;/strong&gt; (formerly hybrid execution). When you submit a query, MotherDuck's query planner evaluates the locations of the datasets. It splits the query plan: executing parts of the query locally on your laptop CPU using local cached data, and executing other parts on MotherDuck's cloud compute nodes (for cloud-hosted Parquet or Iceberg tables). The engine joins these streams dynamically using specialized "bridge" operators.&lt;/p&gt;

&lt;p&gt;In early 2026, MotherDuck added a native &lt;strong&gt;PostgreSQL wire protocol endpoint&lt;/strong&gt;. This allows BI tools and legacy applications to connect directly to MotherDuck using standard PostgreSQL drivers, eliminating the need to install the DuckDB runtime on the client machine. Additionally, MotherDuck features &lt;strong&gt;Pulse (serverless)&lt;/strong&gt; billing with one-second increments and &lt;strong&gt;DuckLake&lt;/strong&gt; integration for scaling storage to the petabyte range.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vdczwtxl6ljqepbpwk1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vdczwtxl6ljqepbpwk1.png" alt="MotherDuck Dual Execution model showing how queries are split by a Hybrid Planner between local laptop CPUs and MotherDuck Cloud Engines" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Bauplan (Serverless Python Pipelines)
&lt;/h3&gt;

&lt;p&gt;For data engineers building pipeline workflows on Apache Iceberg, Bauplan provides a serverless, "zero-infrastructure" execution engine. &lt;/p&gt;

&lt;p&gt;Instead of managing Spark or Kubernetes clusters to run scheduled data transformations, you define your pipeline steps as standard Python or SQL functions. Bauplan spins up stateless, ephemeral compute on-demand to execute the code and shuts down immediately after, utilizing a pay-per-invocation model.&lt;/p&gt;

&lt;p&gt;Bauplan integrates Apache Iceberg with Project Nessie, providing a "Git-for-data" experience. Developers and AI agents can create isolated branches of the lakehouse, run experimental Python pipelines to verify changes, and merge the updates atomically back into production without risking data corruption or paying for idle staging compute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spice.ai (Federated Query Acceleration)
&lt;/h3&gt;

&lt;p&gt;Spice.ai (SpiceAI) targets the data access layer for high-performance applications and AI agents. It functions as a federated query runtime that accelerates slow data queries by materializing "hot" data sets locally.&lt;/p&gt;

&lt;p&gt;Spice.ai implements a tiered caching model. It caches query results in-memory and caches active working sets of data in high-performance local engines like DuckDB or Cayenne (a native columnar engine). &lt;/p&gt;

&lt;p&gt;In its recent v2.0 updates, Spice.ai introduced a &lt;strong&gt;prefix-aware list-files cache&lt;/strong&gt; that speeds up data lake scans, a &lt;strong&gt;statistics cache&lt;/strong&gt; for file metadata, and native Change Data Capture (CDC) syncing that streams updates from databases (like PostgreSQL WAL streams) directly into the local acceleration cache. This keeps the local cached tables updated in real-time without requiring complex Kafka or Debezium setups.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dremio (Distributed MPP Lakehouse Platform)
&lt;/h3&gt;

&lt;p&gt;For enterprise-scale BI, multi-source data federation, and semantic layer management, Dremio serves as the central engine of the lakehouse.&lt;/p&gt;

&lt;p&gt;Dremio is built from the ground up on Apache Arrow, eliminating the serialization tax entirely. When Dremio queries data, the physical execution plan processes memory structures natively in Arrow columnar format and streams results to clients (such as Python scripts or BI tools) using Arrow Flight.&lt;/p&gt;

&lt;p&gt;Dremio achieves sub-second performance on massive cloud data lakes through three architectural layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Columnar Cloud Cache (C3):&lt;/strong&gt; Automatically caches data blocks from object storage (like AWS S3 or Azure ADLS) onto local NVMe drives at execution nodes, turning remote cloud I/O into local disk read speeds.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Reflections:&lt;/strong&gt; Dremio’s query planner automatically and transparently substitutes physically optimized, pre-computed Iceberg materializations to accelerate user queries. As of Dremio v26, Reflections store data exclusively in Iceberg format, deprecating legacy formats to streamline the storage path. Dremio's &lt;strong&gt;Autonomous Reflections&lt;/strong&gt; use AI to observe query patterns over a rolling 7-day window, automatically creating, updating, and dropping Reflections to maintain optimal dashboard performance without manual administration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Open Catalog (Powered by Apache Polaris):&lt;/strong&gt; Dremio's built-in catalog is built on Apache Polaris, which graduated to a top-level Apache project in 2026. The Open Catalog implements the Apache Iceberg REST specification, allowing other engines (like Spark or Flink) to query the same tables securely. It provides Fine-Grained Access Control (FGAC) including column-masking and row-level filtering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dremio’s &lt;strong&gt;AI Semantic Layer&lt;/strong&gt; allows teams to define virtual datasets (views) once and reuse them across all BI and AI applications. This layer embeds descriptions, wikis, and tags directly onto columns and datasets. The semantic layer teaches AI models the business context of your data, allowing AI agents to generate correct, governed SQL queries rather than hallucinating generic code. Dremio also embeds generative AI features to auto-generate wiki descriptions and suggest tags based on schema patterns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglpvzf1wd1q2vbnry16z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglpvzf1wd1q2vbnry16z.png" alt="Dremio MPP query engine architecture showing Columnar Cloud Cache on NVMe, Iceberg-based Autonomous Reflections, Open Catalog powered by Polaris, and Arrow Flight client streaming" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural Selection Framework and Conclusion
&lt;/h2&gt;

&lt;p&gt;Modern data engineering is no longer about choosing between a local script and a massive cluster. It is about matching your toolchain to your data volume, latency SLAs, and organizational needs. &lt;/p&gt;

&lt;p&gt;To guide your selection, follow this decision tree:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Is your workload running locally or on a single node?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;If you prefer writing SQL for analytical queries:&lt;/em&gt; Use &lt;strong&gt;DuckDB&lt;/strong&gt;. It requires zero configuration and handles larger-than-memory data via out-of-core spilling.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you are writing procedural Python or Rust DataFrame pipelines:&lt;/em&gt; Use &lt;strong&gt;Polars&lt;/strong&gt;. Its lazy optimizer and stabilized streaming engine provide rapid execution.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you have legacy PySpark or Spark SQL code but want to avoid JVM overhead:&lt;/em&gt; Use &lt;strong&gt;LakeSail&lt;/strong&gt;. It executes Spark Connect gRPC logical plans natively in Rust.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you are building a custom query engine or analytical tool:&lt;/em&gt; Use &lt;strong&gt;Apache Arrow DataFusion&lt;/strong&gt; as your modular compiler framework.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Does your workload exceed single-node capabilities (multi-TB scale, high concurrency, or cross-source BI)?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;If you want a serverless, hybrid extension of your DuckDB SQL code:&lt;/em&gt; Use &lt;strong&gt;MotherDuck&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you need to build serverless Python pipelines directly on Iceberg with Git-like version control:&lt;/em&gt; Use &lt;strong&gt;Bauplan&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you need to cache and accelerate federated data for local AI/RAG applications:&lt;/em&gt; Use &lt;strong&gt;Spice.ai&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;If you need enterprise-scale BI, semantic governance, multi-source federation, and sub-second SQL queries on Iceberg:&lt;/em&gt; Use &lt;strong&gt;Dremio&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc24rl28d4rnifp08bs2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc24rl28d4rnifp08bs2l.png" alt="Flowchart decision tree helping engineers select the correct analytical engine based on workload and scale" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Single-node data technologies have shifted the boundary of what is possible on a single machine. By utilizing Apache Arrow for zero-copy memory layouts, compilers like DataFusion, and vectorized execution engines, you can process workloads that previously required a complex distributed cluster. &lt;/p&gt;

&lt;p&gt;As you design your next data platform, start by evaluating if your workload can run on a single node. Modern columnar engines let you build, test, and run pipelines with minimal infrastructure complexity. When your data scale or organizational concurrency requires a distributed architecture, transition incrementally using open standards like Apache Iceberg and Apache Arrow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accelerate Your Lakehouse Skills
&lt;/h3&gt;

&lt;p&gt;To deepen your understanding of modern data architectures, consider the following next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Read Lakehouse Reference Materials:&lt;/strong&gt; Explore &lt;strong&gt;"Architecting an Apache Iceberg Lakehouse"&lt;/strong&gt; and other technical publications that cover partition tuning, catalog design, and query optimization at &lt;a href="https://books.alexmerced.com" rel="noopener noreferrer"&gt;books.alexmerced.com&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Build Your Own Local Pipeline:&lt;/strong&gt; Start by downloading &lt;code&gt;pysail&lt;/code&gt; or &lt;code&gt;polars&lt;/code&gt; and testing them against a local Parquet dataset. Compare the query planning time and CPU memory footprint against your existing frameworks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Evaluate Dremio Cloud:&lt;/strong&gt; If your local query engines are hitting limits or you need to federate data across multiple sources, deploy Dremio directly on your S3 data lake. Try Dremio Cloud free for 30 days at &lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;dremio.com/get-started&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
    <item>
      <title>An In-Depth Overview of the Apache Iceberg 1.11.0 Release</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Sat, 23 May 2026 16:52:51 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/an-in-depth-overview-of-the-apache-iceberg-1110-release-4l1n</link>
      <guid>https://dev.to/alexmercedcoder/an-in-depth-overview-of-the-apache-iceberg-1110-release-4l1n</guid>
      <description>&lt;p&gt;Apache Iceberg 1.11.0 was officially released on May 19, 2026, marking a major milestone in the evolution of open data lakehouse architectures. While minor point releases often focus on small bug fixes and dependency bumps, this release introduces fundamental structural changes. The community has completed major initiatives to improve security, extend file format capabilities, and optimize query planning overhead.&lt;/p&gt;

&lt;p&gt;This release represents a convergence of two development focuses. First, it introduces structural changes to the core metadata specification to support advanced security features and lay the groundwork for future format revisions. Second, it stabilizes several feature sets in the Iceberg format specification, moving them from experimental status to fully stable defaults. &lt;/p&gt;

&lt;p&gt;This post analyzes the most critical improvements in the Apache Iceberg 1.11.0 release. We will examine the specific GitHub pull requests, explain the underlying mechanics of each feature, and review what these changes mean for data engineers and platform architects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fde4t7i7o8use435fbc14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fde4t7i7o8use435fbc14.png" alt="Apache Iceberg 1.11.0 release overview diagram showing Security, Catalog, Storage, and Engine pillars" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Manifest List Encryption (PR #7770, #15813)
&lt;/h2&gt;

&lt;p&gt;Security in open data lakehouses has historically focused on encrypting the actual data files stored in object storage. While file-level encryption prevents unauthorized users from reading raw Parquet or ORC data, the table metadata has remained exposed. In a default setup, anyone with read access to the storage bucket could inspect the metadata JSON, manifest lists, and manifest files. &lt;/p&gt;

&lt;p&gt;These metadata files contain sensitive structural details. An attacker scanning an unencrypted manifest list can extract file paths, column names, partitions, partition bounds, and exact null value counts. In highly regulated industries such as healthcare or financial services, this structural exposure constitutes a major data leak.&lt;/p&gt;

&lt;p&gt;To resolve this vulnerability, PR #7770, introduced by @ggershinsky, adds native encryption for manifest lists. This change works alongside follow-up improvements in PR #15813. Manifest lists can now be encrypted using the Galois/Counter Mode (GCM) stream cipher.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metadata JSON (Contains encryption state references)
       │
       ▼
Manifest List (Encrypted via GCM Stream Cipher) ◄── Decrypted in-memory during planning
       │
       ▼
Manifest Files (Point to encrypted Parquet data files)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table encryption configuration can be defined during table creation or updated via table properties:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxp4gysdzg3yropb0gyg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxp4gysdzg3yropb0gyg.png" alt="The table encryption configuration can be defined during table creation or updated via table properties:" width="665" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a query engine plans a scan against an encrypted table, it performs the following sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The client queries the catalog to fetch the table metadata.&lt;/li&gt;
&lt;li&gt;The catalog returns the metadata location along with the required decryption keys.&lt;/li&gt;
&lt;li&gt;The query engine reads the encrypted manifest list from object storage.&lt;/li&gt;
&lt;li&gt;Using the catalog keys, the engine decrypts the manifest list in-memory.&lt;/li&gt;
&lt;li&gt;The engine processes the decrypted partitions and statistics to prune manifest files.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach ensures that the manifest list is never written to disk in plain text. It implements a model of envelope encryption: each metadata file is encrypted with a unique data encryption key (DEK), and these DEKs are encrypted using the table's master key managed by the Key Management Service (KMS). Even if an attacker gains raw access to the storage bucket, they find only encrypted bytes, protecting both the table contents and its structural metadata.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca1lw7zdwh6e2n4xqj5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca1lw7zdwh6e2n4xqj5w.png" alt="Manifest list encryption sequence showing key exchange and decryption query planning" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pluggable File Format API and V4 Spec Foundations (PR #15049)
&lt;/h2&gt;

&lt;p&gt;Historically, Apache Iceberg hardcoded its support for data file formats. The core library contained format-specific code paths for Parquet, ORC, and Avro. If you wanted to query or write a table, the engine executed internal code blocks tailored to those exact structures.&lt;/p&gt;

&lt;p&gt;This hardcoded design created a major bottleneck for format innovation. If a team wanted to test a next-generation format, they had to modify the core engine codebase, extending complex switch statements and format-dependent utilities.&lt;/p&gt;

&lt;p&gt;PR #15049, introduced by @anoopj, restructures this architecture. It introduces a pluggable File Format API that decouples Iceberg core metadata management from physical storage layouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                  Iceberg Core Engine                   │
└───────────────────────────┬────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────┐
│            File Format API Interface Layer             │
└──────┬────────────┬─────────────┬─────────────┬────────┘
       │            │             │             │
       ▼            ▼             ▼             ▼
  ┌─────────┐  ┌─────────┐   ┌─────────┐   ┌─────────┐
  │ Parquet │  │   ORC   │   │ Vortex  │   │  Lance  │
  └─────────┘  └─────────┘   └─────────┘   └─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The File Format API provides a clean plugin interface. A file format is defined as a plugin that implements standard reader and writer interfaces. Iceberg core negotiates table transactions, schemas, and partition specs, while delegating the physical file access to the registered plugin.&lt;/p&gt;

&lt;p&gt;This decoupling makes it practical to support next-generation formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Vortex:&lt;/strong&gt; A general-purpose, modular format designed as a successor to Parquet. It is optimized for high-performance analytics, utilizing fixed-width columns with bitmap masks for nulls. This enables Single Instruction Multiple Data (SIMD) filtering directly on memory-mapped files without CPU decompression cycles. The community is actively using the new API to build a Vortex-backed Iceberg plugin.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Lance:&lt;/strong&gt; A layout built for machine learning and AI workloads. It is optimized for high-dimensional vector search and random access to nested embeddings, implementing index structures such as Inverted File with Product Quantization (IVF-PQ) directly in the file format to enable fast query planning.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Nimble:&lt;/strong&gt; A format optimized for wide tables containing thousands of feature columns. Nimble prioritizes fast decoding over high compression ratios, opting for lightweight run-length and bit-packing compression schemes. This reduces the CPU overhead of ML training loops that consume millions of rows per second.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, PR #15049 introduces the foundational Java interfaces and types for the upcoming V4 manifest specification. These changes prepare Iceberg for format-agnostic manifest storage, ensuring the metadata layer can scale to tables with millions of files without hitting Java memory overhead limits.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14fqfhbw0f5j10i21fag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14fqfhbw0f5j10i21fag.png" alt="Pluggable File Format API architecture decoupling Iceberg core from format plugins" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  REST Client Protocols and Extended Headers (PR #12194)
&lt;/h2&gt;

&lt;p&gt;The REST Catalog protocol has become the standard interface for managing Iceberg tables across multiple processing engines. It isolates clients from catalog catalog details and provides a unified API for schema management, snapshot commits, and credential vending.&lt;/p&gt;

&lt;p&gt;However, as deployments scale inside large enterprises, catalogs need to process custom client context. For example, a platform team might want to track which business unit submitted a query, pass custom security tokens, or inject correlation IDs for distributed tracing. In previous versions, the standard &lt;code&gt;RESTClient&lt;/code&gt; did not allow clients to send custom HTTP headers.&lt;/p&gt;

&lt;p&gt;PR #12194, written by @gaborkaszab, solves this constraint by extending header support inside the &lt;code&gt;RESTClient&lt;/code&gt; implementations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────┐
│      Iceberg REST Client       │
│  (Spark, Flink, Trino, etc.)   │
└───────────────┬────────────────┘
                │
                │  POST /v1/namespaces/db/tables/events
                │  Custom-Headers:
                │    - X-Trace-Id: trace-98421
                │    - X-Tenant-Id: finance-billing
                │
                ▼
┌────────────────────────────────┐
│      REST Catalog Server       │
│  (Parses headers for auditing)  │
└────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this update, client engines can configure and inject custom headers into every REST call. The client-server handshake follows this sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The client initializes the REST catalog using the properties map.&lt;/li&gt;
&lt;li&gt;The client specifies static custom headers using the prefix &lt;code&gt;header.custom.&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;   &lt;span class="py"&gt;header.custom.X-Tenant-Id&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;finance-billing&lt;/span&gt;
   &lt;span class="py"&gt;header.custom.X-Trace-Id&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;system-trace-99&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;During request execution, the &lt;code&gt;RESTClient&lt;/code&gt; intercepts the HTTP call and injects these custom headers.&lt;/li&gt;
&lt;li&gt;The REST catalog server processes the headers to apply dynamic authorization, audit logging, or request routing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This change enables the following capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Auditing and Governance:&lt;/strong&gt; Engines can pass tenant identifiers or user profiles in the HTTP headers, allowing the REST catalog server to log catalog operations with full user context.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Distributed Tracing:&lt;/strong&gt; Tracing headers such as W3C Trace Context can propagate from client engines through the catalog server, providing end-to-end trace visibility for query planning operations.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dynamic Authorization:&lt;/strong&gt; Clients can send custom authorization tokens that the REST catalog server evaluates dynamically to enforce fine-grained access control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The properties are configured during catalog initialization using the standard configuration map, making it simple to roll out headers across existing query platforms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gvxlmz7rvrmpnneqc2a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gvxlmz7rvrmpnneqc2a.png" alt="Extended header propagation between Iceberg client and REST Catalog server" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Overwrite-Aware Table Registration (PR #15525)
&lt;/h2&gt;

&lt;p&gt;In multi-tenant data platforms, multiple engines frequently access and modify the same table metadata. When registering a new table or importing an existing table state into the catalog, concurrency conflicts can occur.&lt;/p&gt;

&lt;p&gt;If two separate processes attempt to register or overwrite a table reference at the same location simultaneously, a naive catalog might register the second request, silently overwriting the first. This creates data inconsistencies where the catalog points to an outdated or incorrect &lt;code&gt;metadata.json&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;PR #15525, written by @sririshindra, adds overwrite-aware table registration to the catalog API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Writer 1: Commits events_v1 ────► [Catalog Table Pointer] ◄──── Writer 2: Commits events_v2
                                            │
                                            ├────────► If conflict: Catalog rejects Writer 2
                                            └────────► Prevents silent metadata overwrites
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This implementation leverages Optimistic Concurrency Control (OCC) at the catalog level. The conflict resolution sequence proceeds as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writer A and Writer B both read the current table state pointing to snapshot v1.&lt;/li&gt;
&lt;li&gt;Writer A writes new data files, generating metadata version &lt;code&gt;metadata_v2.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writer B writes new data files in parallel, generating metadata version &lt;code&gt;metadata_v3.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writer A calls the catalog's &lt;code&gt;/v1/namespaces/db/tables/events/register&lt;/code&gt; endpoint, stating that the expected base location is &lt;code&gt;metadata_v1.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The catalog verifies the base matches, registers the new pointer to &lt;code&gt;metadata_v2.json&lt;/code&gt;, and updates the table version.&lt;/li&gt;
&lt;li&gt;Writer B attempts to register its state, listing &lt;code&gt;metadata_v1.json&lt;/code&gt; as its expected base.&lt;/li&gt;
&lt;li&gt;The catalog detects that the current pointer is now &lt;code&gt;metadata_v2.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The catalog rejects Writer B's request, returning a HTTP 409 Conflict. Writer B must re-read the updated table state, resolve any overlapping partition commits, and retry the registration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This validation ensures that table registration is safe and prevents silent metadata overwrites in highly active environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmdth3agu4sjimwe25so.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmdth3agu4sjimwe25so.png" alt="Flowchart of table registration verifying catalog overwrite state and rejecting transaction on conflicts" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deletion Vector Pruning in Snapshot Validation (PR #15653)
&lt;/h2&gt;

&lt;p&gt;One of the major highlights of the V3 format specification is the stabilization of deletion vectors. Deletion vectors improve row-level delete performance by replacing positional delete files with Roaring bitmaps. Instead of writing a new delete file for every minor update, the engine updates a binary bitmap linked directly to the data file.&lt;/p&gt;

&lt;p&gt;These deletion bitmaps are stored in the Puffin file format. You can inspect active deletion vector locations using metadata system tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deletion_vector&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'my_catalog.schema.events'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, as tables grow to hold millions of data files, validating these deletion vectors during query planning can introduce latency. During scan planning, the query engine must ensure that the deletion vectors linked in the metadata are valid and match the corresponding data files.&lt;/p&gt;

&lt;p&gt;In earlier versions, this validation was executed across the entire table snapshot during plan initialization. If you had a 50 TB table and queried a single day, the planner still spent time validating deletion vectors for the entire table.&lt;/p&gt;

&lt;p&gt;PR #15653, introduced by @anoopj, optimizes this process. It adds manifest partition pruning to deletion vector validation inside the &lt;code&gt;MergingSnapshotProducer&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query Filter: WHERE event_date = '2026-05-23'
       │
       ▼
Partition Pruning Step
       │
       ├─► Skip Partition '2026-05-22' ──► Skip Deletion Vector Validation
       │
       └─► Read Partition '2026-05-23'  ──► Run Deletion Vector Validation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this change, the query planner matches the query filter predicates against partition bounds before executing deletion vector checks. If a partition is pruned out, the engine skips validating the deletion vectors for the files in that partition. This change reduces planning CPU cycles and improves scan startup times for partitioned tables.&lt;/p&gt;

&lt;p&gt;For a detailed look at how hidden partitioning helps the query engine perform partition pruning and reduce metadata scan sizes, read the &lt;a href="https://alexmerced.blog/blog/2026-04-29-iceberg-masterclass-05-hidden-partitioning.html" rel="noopener noreferrer"&gt;Apache Iceberg Hidden Partitioning Post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhi1nggcxxlub98nrr0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhi1nggcxxlub98nrr0w.png" alt="Diagram showing deletion vector validation pruning skipping skipped partitions during planning" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheduled Credential Lifecycle Refresh (PR #15678, #15732, #15696)
&lt;/h2&gt;

&lt;p&gt;To security-harden data lakehouses, platforms avoid using long-lived storage credentials. Instead, query engines authenticate using temporary tokens vended by the REST catalog or cloud identity providers. These credentials typically have short lifespans, often expiring after one hour.&lt;/p&gt;

&lt;p&gt;This security model creates issues for long-running operations. If a massive query runs for 90 minutes, or a streaming Flink sink runs continuously, the temporary credentials expire mid-job. When the client attempts to write new files or fetch manifests after the expiration window, the storage client throws an authentication exception, failing the job.&lt;/p&gt;

&lt;p&gt;The 1.11.0 release resolves this lifecycle problem. PR #15678 (by @danielcweeks) and PR #15732 (by @nastra) add scheduled refresh threads to the &lt;code&gt;S3FileIO&lt;/code&gt; client. A parallel change in PR #15696 (by @nastra) implements the same capability for &lt;code&gt;GCSFileIO&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query Thread (Reads/Writes Data)
       │
       ├───────► Token Expiration Approaching (e.g. at 50 minutes)
       │
Background Refresh Thread
       │
       ├───────► Send Request to Catalog ──► Fetch New Credentials
       │
       └───────► Update S3FileIO/GCSFileIO Credentials In-Memory
       │
Query Thread (Continues without interruption)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The credential refresh system runs a background daemon thread that tracks token expiration times. The lifecycle is controlled by the following properties:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sk87k8lgivk7ard87sh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sk87k8lgivk7ard87sh.png" alt="The credential refresh system runs a background daemon thread that tracks token expiration times" width="664" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before the active credential expires, the background thread automatically polls the catalog's &lt;code&gt;/v1/tokens&lt;/code&gt; endpoint for refreshed tokens and updates the file system client in-memory. The main query and write threads continue to run without interruption, eliminating query failures caused by expired credentials.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazus5b4n1otfy8l8aoji.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazus5b4n1otfy8l8aoji.png" alt="Sequence flow showing background thread updating AWS/GCS storage client credentials before expiration" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Spark Streaming Triggers and Z-Ordering (PR #13824, #15706)
&lt;/h2&gt;

&lt;p&gt;Apache Spark remains the primary engine for heavy write workloads and batch compaction in Iceberg tables. Version 1.11.0 includes several updates to improve Spark streaming and layout optimization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger.AvailableNow Support (PR #13824, #14026)
&lt;/h3&gt;

&lt;p&gt;PR #13824, introduced by @alexprosak, adds support for the &lt;code&gt;AvailableNow&lt;/code&gt; trigger in Spark Structured Streaming. This change was also backported to Spark 4.0, 3.5, and 3.4 in PR #14026.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Continuous Trigger:
[Read Batch 1] -&amp;gt; [Write] -&amp;gt; [Wait] -&amp;gt; [Read Batch 2] -&amp;gt; [Write] -&amp;gt; (Runs indefinitely)

AvailableNow Trigger:
[Scan All Available Data] -&amp;gt; [Process Batch 1] -&amp;gt; [Process Batch 2] -&amp;gt; [Write All] -&amp;gt; [Graceful Shutdown]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Spark streaming, the default trigger runs continuously in the background, consuming resources even when no new files are arriving. The alternative &lt;code&gt;Once&lt;/code&gt; trigger processes only a single batch and shuts down, which can leave data unprocessed if a large backlog has accumulated.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AvailableNow&lt;/code&gt; trigger combines the benefits of both approaches. It scans the source for all available data, splits the workload into consecutive micro-batches, processes them all in a single run, and then shuts down the streaming context. This is configured in PySpark as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Configure Trigger.AvailableNow with Iceberg source and sink
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod_catalog.db.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableNow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/checkpoints/events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod_catalog.db.events_compacted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This trigger configuration allows data platforms to run streaming ingestion pipelines as scheduled cron jobs, reducing cluster idle time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Z-Order Column Collision Validation (PR #15706)
&lt;/h3&gt;

&lt;p&gt;PR #15706, introduced by @YanivZalach, addresses a failure mode during Z-order layout optimization. Spark uses the internal column name &lt;code&gt;ICEZVALUE&lt;/code&gt; during Z-order sorting. If a user table already contained a column named &lt;code&gt;ICEZVALUE&lt;/code&gt;, the compaction process failed or generated incorrect sort orders. &lt;/p&gt;

&lt;p&gt;The update adds strict schema validation that checks for column name collisions before running Z-order compactions, preventing data corruption.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2etiq1egduqm9535gae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2etiq1egduqm9535gae.png" alt="Comparison of continuous micro-batch streaming vs Spark AvailableNow trigger batches" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Flink Post-Commit Maintenance and Branch Compaction (PR #15566, #15672, #14148)
&lt;/h2&gt;

&lt;p&gt;Apache Flink is the standard engine for real-time streaming ingestion into Iceberg tables. Streaming ingestion has different write characteristics than batch ingestion, often writing many small files at high frequency. Iceberg 1.11.0 adds features to manage these files directly within Flink pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flink Post-Commit Maintenance (PR #15566, #15667)
&lt;/h3&gt;

&lt;p&gt;PR #15566, written by &lt;a class="mentioned-user" href="https://dev.to/mxm"&gt;@mxm&lt;/a&gt;, adds support for arbitrary post-commit maintenance tasks inside the Flink &lt;code&gt;IcebergSink&lt;/code&gt; builder. This is also backported to active Flink branches in PR #15667.&lt;/p&gt;

&lt;p&gt;During streaming ingestion, Flink commits data to the Iceberg table at every checkpoint. These frequent commits generate a large number of small manifest files. With the new post-commit interface, you can attach background maintenance tasks directly to the sink:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Configure Flink sink with post-commit compaction&lt;/span&gt;
&lt;span class="nc"&gt;IcebergSink&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forRowData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataStream&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tableLoader&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;icebergTable&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;tableLoader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tableLoader&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeParallelism&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;distributionMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DistributionMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;HASH&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;postCommitMaintenance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;PostCommitMaintenance&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;optimizeDataFiles&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;rewriteManifests&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;append&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After a commit succeeds, Flink runs compaction and manifest cleaning tasks in the background, keeping the table structure optimized without requiring external scheduler jobs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flink Stream Ingestion
       │
       ▼
[Commit Data File (Checkpoint)]
       │
       ├───────► Post-Commit Trigger
       │
       ▼
[Background Maintenance Action (RewriteDataFiles / Compaction)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Flink Branch Compaction Support (PR #15672, #15690)
&lt;/h3&gt;

&lt;p&gt;PR #15672, also written by &lt;a class="mentioned-user" href="https://dev.to/mxm"&gt;@mxm&lt;/a&gt;, adds branch support to the Flink &lt;code&gt;RewriteDataFiles&lt;/code&gt; action. &lt;/p&gt;

&lt;p&gt;Historically, Flink's background compaction actions could only run on the table's main branch. In modern architectures, engines often ingest experimental data or staging runs into separate table branches. Flink can now run file compaction directly on these non-main branches, keeping staging and experiment branches organized before they are merged back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flink Metadata Columns (PR #14148)
&lt;/h3&gt;

&lt;p&gt;PR #14148, introduced by @Guosmilesmile, exposes metadata columns to Flink readers. &lt;/p&gt;

&lt;p&gt;Flink applications can now read the &lt;code&gt;_row_id&lt;/code&gt; and &lt;code&gt;_last_updated_sequence_number&lt;/code&gt; system columns. This is useful for CDC (Change Data Capture) reconciliation pipelines that need to track the exact ingestion sequence of rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n7dkxlw4x3qjc6vesn9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n7dkxlw4x3qjc6vesn9.png" alt="Flink data sink writing data and executing post-commit branch compaction on experimental branch" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  JSON to Variant Mapping and Spec Cleanups (PR #13137, #14045)
&lt;/h2&gt;

&lt;p&gt;The Variant type is a key part of the Iceberg V3 specification, designed to store semi-structured data using a binary representation that supports predicate pushdown. Iceberg 1.11.0 refines this integration across multiple engines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Variant Type Validation (PR #13137, #14081)
&lt;/h3&gt;

&lt;p&gt;PR #13137 (by @manirajv06) and PR #14081 (by @geruh) add schema validation and filtering rules for the Variant type in Parquet metrics. &lt;/p&gt;

&lt;p&gt;These updates ensure that Parquet file readers can extract column-level statistics from nested variant structures. This allows the query engine to prune files based on nested variant fields, improving query performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trino Variant Type Mapping
&lt;/h3&gt;

&lt;p&gt;In parallel, query engine connectors are adopting these changes. Trino now maps its native &lt;code&gt;JSON&lt;/code&gt; type to Iceberg's Variant type in V3 tables. This means you can write JSON data from Trino and query it with predicate pushdown, avoiding the performance penalties of plain string JSON.&lt;/p&gt;

&lt;h3&gt;
  
  
  Positional Deletes with Row Data Deprecated (PR #14045)
&lt;/h3&gt;

&lt;p&gt;PR #14045, written by @pvary, deprecates positional delete files that embed row data. &lt;/p&gt;

&lt;p&gt;In Iceberg V2, positional delete files could store the actual deleted row data alongside the file path and row offset. While this design saved a join step during reads, it duplicated data in the delete files, increasing storage costs and metadata complexity. &lt;/p&gt;

&lt;p&gt;The community has deprecated this option in favor of Deletion Vectors, simplifying the V3 read path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Upgrade Path and Connector Compatibility
&lt;/h2&gt;

&lt;p&gt;All V3 features: manifest list encryption, deletion vectors, Variant types, geospatial types, and nanosecond timestamps: require upgrading your tables to format version 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Existing V2 Table
       │
       ├───────► Run: ALTER TABLE events SET TBLPROPERTIES ('format-version' = '3')
       │
Upgraded V3 Table
       │
       ├───────► New writes use Deletion Vectors and Variant type
       └───────► Existing data files are left untouched (no rewrite required)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The upgrade is a metadata-only operation executed using SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Upgrade an existing table to Iceberg V3 format version&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'format-version'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'3'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This operation updates the &lt;code&gt;format-version&lt;/code&gt; pointer in the table's metadata JSON. It does not rewrite your existing data files, which remain in place and continue to be readable. &lt;/p&gt;

&lt;p&gt;New writes to the table will adopt V3 features automatically. For example, subsequent update or delete statements will write deletion vectors instead of positional delete files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lifecycle Status Updates
&lt;/h3&gt;

&lt;p&gt;Before planning your migration to V3, review the engine compatibility changes in Iceberg 1.11.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Java 11 Support Dropped:&lt;/strong&gt; Iceberg 1.11.0 drops support for Java 11. Core libraries and engine connectors now require &lt;strong&gt;Java 17&lt;/strong&gt; or &lt;strong&gt;Java 21&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Spark 3.4 Support Deprecated:&lt;/strong&gt; Support for Spark 3.4 is deprecated. Teams should migrate to Spark 3.5 or Spark 4.0+.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Flink 1.19 Support Removed:&lt;/strong&gt; Flink 1.19 is no longer supported. The release adds support for &lt;strong&gt;Flink 2.1.0&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make sure all query engines and toolchains in your lakehouse deployment support Iceberg V3 and Java 17 before upgrading production tables.&lt;/p&gt;

&lt;p&gt;For more on managing query performance optimizations and table format versions inside Dremio, refer to the &lt;a href="https://www.dremio.com/blog/how-dremio-keeps-agentic-analytics-fast-without-manual-tuning/" rel="noopener noreferrer"&gt;Dremio Autonomous Performance Blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq8bcr70w3vayozhi0fe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq8bcr70w3vayozhi0fe.png" alt="Table upgrade timeline showing migration SQL and deprecated connector support list" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Iceberg 1.11.0 is a significant release for the project. It moves beyond incremental enhancements to deliver major architectural updates. &lt;/p&gt;

&lt;p&gt;The unified File Format API restructures how Iceberg interacts with physical storage formats. This change makes it easier to integrate next-generation codecs designed for AI and high-performance workloads. &lt;/p&gt;

&lt;p&gt;At the same time, the stabilization of V3 features provides a production-ready path for deletion vectors, Variant data, geospatial types, and nanosecond precision. These features help organizations optimize query performance and reduce operational overhead.&lt;/p&gt;

&lt;p&gt;If you are running Iceberg V2 tables in production, evaluate your workloads to identify tables that will benefit from a V3 upgrade. In particular, tables with active update patterns or large JSON columns will see immediate performance gains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build Your Data Lakehouse Expertise
&lt;/h3&gt;

&lt;p&gt;If you are designing, building, or managing modern data platforms, staying ahead of formatting specifications is critical. To deepen your understanding of these technologies, consider reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;"Architecting an Apache Iceberg Lakehouse"&lt;/strong&gt;: An architectural guide to designing open lakehouse platforms, managing catalog architectures, and optimizing table layouts.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Other Data Lakehouse Publications&lt;/strong&gt;: Practical books covering hidden partitioning, schema evolution, and query acceleration engines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Find these books and other lakehouse learning resources at &lt;a href="https://books.alexmerced.com" rel="noopener noreferrer"&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To query your Iceberg V3 tables with automatic file layout optimization, background compaction, and zero infrastructure management, start a free trial of Dremio Cloud at &lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>security</category>
    </item>
    <item>
      <title>Semantic Layer Best Practices: 7 Mistakes to Avoid</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 18:03:06 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/semantic-layer-best-practices-7-mistakes-to-avoid-ihk</link>
      <guid>https://dev.to/alexmercedcoder/semantic-layer-best-practices-7-mistakes-to-avoid-ihk</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ll1esq3m14tx2gyh5z1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ll1esq3m14tx2gyh5z1.png" alt="Semantic layer best practices checklist — checks and mistakes" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Semantic layers don't fail because the technology is wrong. They fail because of design decisions made in the first two weeks — choices that seem reasonable at the time and create compounding problems for months afterward.&lt;/p&gt;

&lt;p&gt;Here are the seven mistakes that kill semantic layer projects, and how to avoid each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 1: Defining Metrics in Multiple Places
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Revenue is defined in a Tableau calculated field, a Power BI DAX measure, a dbt model, and a SQL view. Four sources of truth. None of them agree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: Teams adopt new tools without migrating metric definitions. Each tool gets its own model. Over time, the definitions drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Every metric gets exactly one canonical definition in the semantic layer. All downstream tools query that definition. No exceptions. When someone needs Revenue, they query &lt;code&gt;business.revenue&lt;/code&gt;, not their own formula.&lt;/p&gt;

&lt;p&gt;This principle extends to AI agents. If your AI generates its own metric formulas instead of referencing the semantic layer, you've just added another source of truth — the least trustworthy one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 2: Skipping the Bronze Layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineer creates a Silver view that joins raw source tables directly, mixing data cleanup (type casting, column renaming) with business logic (filters, calculations) in a single query. When the source schema changes — a column is renamed, a type is modified — the Silver view breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: The Bronze layer feels redundant. It's just a 1:1 mapping of the source. Why add a layer that doesn't change anything?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: The Bronze layer absorbs schema changes. When a source renames &lt;code&gt;col_7&lt;/code&gt; to &lt;code&gt;order_date_utc&lt;/code&gt;, you update one Bronze view. The Silver and Gold views above it don't change. This insulation is worth the tiny overhead of maintaining passthrough views.&lt;/p&gt;

&lt;p&gt;Bronze views also standardize data formats. Timestamps normalized to UTC. Strings cast to consistent encodings. Column names made human-readable. This cleanup happens once, at the bottom of the stack, and every view above benefits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 3: Using SQL Reserved Words as Column Names
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferqp8jpg31vu8jwnebpa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferqp8jpg31vu8jwnebpa.png" alt="Bad vs. good naming conventions — cryptic abbreviations vs. clear business names" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A Bronze view exposes a column called &lt;code&gt;Date&lt;/code&gt;. Now every downstream query must reference &lt;code&gt;"Date"&lt;/code&gt; with double quotes. Analysts forget. AI agents don't quote it at all. Queries break intermittently. Debugging is frustrating because the error messages are cryptic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: Source systems often use generic names. &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt;, &lt;code&gt;Group&lt;/code&gt;, &lt;code&gt;Role&lt;/code&gt; — all are SQL reserved words. Bronze views that don't rename them propagate the problem to every consumer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Rename early. In the Bronze layer, map &lt;code&gt;Date&lt;/code&gt; to &lt;code&gt;TransactionDate&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt; to &lt;code&gt;EventTimestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt; to &lt;code&gt;CustomerOrder&lt;/code&gt;. Use domain-specific prefixes that are unambiguous and never conflict with SQL keywords.&lt;/p&gt;

&lt;p&gt;This small decision saves hundreds of hours of debugging across the life of the semantic layer. It also dramatically improves AI agent accuracy, since language models generating SQL rarely add appropriate quoting for reserved words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 4: Building Without Stakeholder Input
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineering team builds 50 Silver views based on the database schema. They expose every table, every column, every possible metric. Business users look at the result, don't recognize any of the terms, and go back to their spreadsheets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: Data engineers understand the schema. They assume the schema structure maps to business needs. It usually doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with a metric glossary co-created with stakeholders from Sales, Finance, Marketing, and Product. Ask them: What are your top 5 metrics? How do you calculate them? What decisions do they drive? Build the Silver layer around those answers, not around the database schema.&lt;/p&gt;

&lt;p&gt;This step feels slow. It's the fastest path to adoption. A semantic layer that uses business language and models business concepts gets adopted. A semantic layer that mirrors the database schema gets ignored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 5: Treating Documentation as Optional
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Views are created with no Wikis, no column descriptions, no Labels. The semantic layer works for the person who built it. Everyone else — analysts, AI agents, new team members — can't figure out what the views mean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: Documentation takes time. Deadlines are tight. Teams plan to "add documentation later." Later never comes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Make documentation part of the view creation process, not a follow-up task. At minimum, every view gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A one-sentence description of what it represents&lt;/li&gt;
&lt;li&gt;Labels for governance (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;Column descriptions for any non-obvious field&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern platforms reduce this burden with AI-generated documentation. &lt;a href="https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Dremio's generative AI&lt;/a&gt; samples table data and auto-generates Wiki descriptions and Label suggestions. The AI provides a 70% first draft. The data team adds domain context for the other 30%.&lt;/p&gt;

&lt;p&gt;Undocumented views are invisible to AI agents. If the Wiki is empty, the AI agent has no context to generate accurate SQL. Documentation isn't just nice to have. It's an accuracy requirement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 6: Applying Security at the BI Tool Level Only
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Row-level security is configured in Tableau so regional managers only see their region. Then an analyst opens a SQL client, queries the underlying table directly, and sees all regions. The security was enforced in the dashboard, not in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: BI tools make it easy to apply filters and security rules. Data platforms require more setup. Teams take the easy path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Enforce access policies at the semantic layer, not the BI layer. Row-level security and column masking should be applied on the virtual datasets (views). Every query path — dashboard, notebook, API, AI agent — inherits the same rules.&lt;/p&gt;

&lt;p&gt;Dremio implements this through Fine-Grained Access Control (FGAC): policies defined as UDFs at the view level. A regional manager queries &lt;code&gt;business.revenue&lt;/code&gt; and automatically sees only their region, regardless of how they access the data. No security gaps between tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 7: Trying to Model Everything at Once
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkq07sz9l2pa6e76xlswu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkq07sz9l2pa6e76xlswu.png" alt="Incremental growth — from a small core to a comprehensive semantic layer" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: The team commits to building a complete semantic layer covering every source, every table, and every metric. The project takes six months. By the time it launches, requirements have changed, stakeholder interest has waned, and half the views are out of date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's common&lt;/strong&gt;: Ambitious leaders want a "complete" solution. Data teams want to avoid rework. Neither wants to ship an incomplete layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with 3-5 core metrics that the organization actively debates (usually Revenue, Active Users, Churn). Build one Bronze → Silver → Gold pipeline per metric. Validate that the same question produces the same answer across two different tools.&lt;/p&gt;

&lt;p&gt;Once those metrics are stable, expand incrementally. Add new sources, new views, new metrics — one at a time. Each addition is low-risk because the layered architecture isolates changes. A new Gold view doesn't affect existing Silver views.&lt;/p&gt;

&lt;p&gt;The fastest semantic layers reach 80% organizational coverage not by modeling everything up front, but by proving value quickly and expanding from momentum.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do Next
&lt;/h2&gt;

&lt;p&gt;Pick one mistake from this list. Check whether your semantic layer (or your plan for one) is making it. Fix that one thing this week. Then come back for the next one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>architecture</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How a Self-Documenting Semantic Layer Reduces Data Team Toil</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 17:57:43 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/how-a-self-documenting-semantic-layer-reduces-data-team-toil-322i</link>
      <guid>https://dev.to/alexmercedcoder/how-a-self-documenting-semantic-layer-reduces-data-team-toil-322i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26ntxrsg7bswuwemstxd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26ntxrsg7bswuwemstxd.png" alt="Self-documenting semantic layer — AI generating descriptions and labels automatically" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every data team knows documentation is important. And almost every data team has a backlog of undocumented tables, unlabeled columns, and outdated descriptions that nobody has time to fix. The problem isn't motivation. It's that manual documentation doesn't scale.&lt;/p&gt;

&lt;p&gt;A self-documenting semantic layer changes the equation. Instead of asking humans to describe every column in every table, the platform generates descriptions automatically, suggests governance labels from data patterns, and propagates context through the view chain. Documentation becomes a byproduct of building the semantic layer, not a separate project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Documentation Problem Nobody Solves
&lt;/h2&gt;

&lt;p&gt;Industry surveys consistently find that 70% or more of enterprise data assets are undocumented or poorly documented. The result: analysts spend 30-40% of their time searching for data and trying to understand what it means before they can start analyzing it.&lt;/p&gt;

&lt;p&gt;This isn't just a productivity problem. Undocumented data is a governance risk. A column named &lt;code&gt;status&lt;/code&gt; with values 0, 1, 2, and 3 could mean anything. An analyst guesses. An AI agent guesses worse. Nobody verifies. The wrong assumptions get baked into dashboards that drive business decisions.&lt;/p&gt;

&lt;p&gt;Data teams respond with documentation sprints. They burn a week writing Wiki pages for their top 50 tables. Two months later, half the descriptions are outdated because schemas have changed. The cycle repeats.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Self-Documenting Actually Means
&lt;/h2&gt;

&lt;p&gt;A self-documenting semantic layer generates and maintains documentation with minimal human effort. Three mechanisms work together:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-generated descriptions&lt;/strong&gt;: The platform samples data in a table and generates human-readable descriptions for each column and the table itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated label suggestions&lt;/strong&gt;: The platform analyzes column names, data types, and value patterns to suggest governance labels (PII, Finance, Certified).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metadata propagation&lt;/strong&gt;: When a Silver view references a Bronze view, column descriptions flow downstream automatically. Documentation written once at the Bronze level appears everywhere the column is used.&lt;/p&gt;

&lt;p&gt;Human oversight is still essential. AI provides a 70% first draft. Data engineers add the domain-specific context that only they know: business rules, edge cases, known data quality issues. The point isn't to eliminate human documentation. It's to eliminate the blank page.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI-Generated Descriptions
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx0efa5fwory6g83ejs4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx0efa5fwory6g83ejs4.png" alt="AI scanning data tables and generating documentation automatically" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern semantic layer platforms can sample a table's data and generate meaningful descriptions automatically.&lt;/p&gt;

&lt;p&gt;Consider a column named &lt;code&gt;cltv&lt;/code&gt; in a table called &lt;code&gt;customers&lt;/code&gt;. The AI samples values (1200.50, 3400.00, 780.25), examines the column name and table context, and generates:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;cltv&lt;/strong&gt;: Customer Lifetime Value in USD. Represents the total revenue attributed to this customer from their first purchase to the current date, excluding refunded transactions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not every generated description will be this precise. But most are useful enough to replace the current state: an empty description that tells the analyst nothing.&lt;/p&gt;

&lt;p&gt;More examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A column with values "US", "UK", "DE" → "ISO 3166 alpha-2 country code for the customer's billing address"&lt;/li&gt;
&lt;li&gt;A DATE column named &lt;code&gt;created_at&lt;/code&gt; in a &lt;code&gt;subscriptions&lt;/code&gt; table → "Date the subscription was created"&lt;/li&gt;
&lt;li&gt;A FLOAT column named &lt;code&gt;mrr&lt;/code&gt; → "Monthly Recurring Revenue in the account's base currency"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Automated Label Suggestions
&lt;/h2&gt;

&lt;p&gt;Labels categorize data for governance and discovery. Manually tagging every column in a data warehouse with hundreds of tables is impractical. AI-based label suggestion makes it manageable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Columns containing email-like patterns (text with @ symbols) → suggested label: &lt;strong&gt;PII&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Columns with phone number patterns → suggested label: &lt;strong&gt;PII&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Columns named &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;total&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt; → suggested label: &lt;strong&gt;Finance&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Columns in tables marked "Certified" → suggested label propagated to downstream views&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Dremio's approach&lt;/a&gt; combines these suggestions with human approval. The AI proposes labels. A data engineer reviews and accepts or rejects. Over time, the catalog fills up with accurate, useful labels without dedicated labeling sprints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metadata Propagation Through Views
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadlmenght4l27088rsyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadlmenght4l27088rsyt.png" alt="Metadata flowing through Bronze, Silver, and Gold view layers" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a well-designed semantic layer, documentation shouldn't need to be written more than once. The Bronze-Silver-Gold view architecture creates a natural propagation path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bronze layer&lt;/strong&gt;: Document the &lt;code&gt;CustomerID&lt;/code&gt; column as "Unique identifier for the customer, sourced from the CRM system."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silver layer&lt;/strong&gt;: A Silver view references &lt;code&gt;CustomerID&lt;/code&gt;. The description propagates automatically. No re-documentation needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gold layer&lt;/strong&gt;: An aggregated Gold view groups by &lt;code&gt;CustomerID&lt;/code&gt;. The description carries through.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This propagation is especially valuable for join columns, filter columns, and commonly used dimensions that appear in dozens of views. Write the description once at the source, and it follows the column everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  How This Reduces Toil
&lt;/h2&gt;

&lt;p&gt;The impact on data team productivity is measurable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Documentation Task&lt;/th&gt;
&lt;th&gt;Manual Approach&lt;/th&gt;
&lt;th&gt;Self-Documenting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Column descriptions&lt;/td&gt;
&lt;td&gt;Write each by hand&lt;/td&gt;
&lt;td&gt;AI generates draft, human refines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance labels&lt;/td&gt;
&lt;td&gt;Manual tagging sprint&lt;/td&gt;
&lt;td&gt;AI suggests from data patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downstream view docs&lt;/td&gt;
&lt;td&gt;Re-write for each view&lt;/td&gt;
&lt;td&gt;Propagated from upstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema change updates&lt;/td&gt;
&lt;td&gt;Manually check and update&lt;/td&gt;
&lt;td&gt;AI re-scans and flags changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New table onboarding&lt;/td&gt;
&lt;td&gt;Create from scratch&lt;/td&gt;
&lt;td&gt;AI generates baseline immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The net effect: documentation coverage goes from 30% (what the team could manage manually) to 80-90% (AI baseline + human refinement). The team spends hours instead of weeks on documentation. And the documentation stays current because the AI can re-scan when schemas change — flagging outdated descriptions instead of waiting for someone to notice.&lt;/p&gt;

&lt;p&gt;For AI agents, this improvement is material. A richer, more accurate semantic layer means the AI generates better SQL, hallucinates less, and requires fewer corrections. Self-documentation isn't just a productivity feature. It's an AI accuracy feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do Next
&lt;/h2&gt;

&lt;p&gt;Pick your most-used table. Open it in your data platform. How many columns have descriptions? How many have governance labels? If the answer is "not many," calculate how long it would take to document the entire table manually. Then consider a platform that does 70% of that work for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>dataengineering</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Migrating to Apache Iceberg: Strategies for Every Source System</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 17:48:08 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/migrating-to-apache-iceberg-strategies-for-every-source-system-424j</link>
      <guid>https://dev.to/alexmercedcoder/migrating-to-apache-iceberg-strategies-for-every-source-system-424j</guid>
      <description>&lt;p&gt;This is Part 15, the final article of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Part 14&lt;/a&gt; covered hands-on Dremio Cloud. This article covers the three migration strategies and how to execute a zero-downtime migration using the view swap pattern.&lt;/p&gt;

&lt;p&gt;Most organizations do not start with Iceberg. They have years of data in Hive tables, data warehouses, CSV files, databases, and Parquet directories. Moving this data to Iceberg is not an all-or-nothing project. The best migrations happen incrementally, one dataset at a time, with no disruption to existing consumers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Three Migration Strategies
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0iy5hxts82n31d0w2uv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0iy5hxts82n31d0w2uv.png" alt="Three paths to Iceberg: in-place migration, full rewrite, and shadow migration" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. In-Place Migration (Metadata Only)
&lt;/h3&gt;

&lt;p&gt;In-place migration creates Iceberg metadata over existing Parquet or ORC files without copying or moving them. The data files stay exactly where they are; only new Iceberg metadata is created to track them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spark example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;migrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'db.existing_hive_table'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This converts a Hive table to Iceberg by scanning its files and creating the Iceberg metadata tree (metadata.json, manifest list, manifest files) that references them. The Parquet files are untouched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Fast. No data movement. The table becomes queryable as Iceberg immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; The existing file layout (sizes, partitioning, sort order) is inherited. If the original files are poorly organized, you inherit those problems. Requires the original files to be in Parquet or ORC format.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Full Rewrite (CTAS)
&lt;/h3&gt;

&lt;p&gt;A full rewrite reads data from any source and writes it as a new Iceberg table with optimal partitioning and file sizes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Spark&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;iceberg&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hive_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;

&lt;span class="c1"&gt;-- Dremio&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;legacy_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Best result. Optimal file sizes, correct sort order, proper partitioning. The table is perfectly organized from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Requires reading and writing all data, which takes time and compute resources. The source system must be available during the migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Shadow Migration (Build and Swap)
&lt;/h3&gt;

&lt;p&gt;Shadow migration builds the Iceberg table alongside the existing source, then swaps consumers from old to new when ready:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new Iceberg table with the desired schema and partitioning&lt;/li&gt;
&lt;li&gt;Backfill historical data from the legacy source&lt;/li&gt;
&lt;li&gt;Set up incremental sync to keep the Iceberg table current&lt;/li&gt;
&lt;li&gt;Validate data quality between old and new&lt;/li&gt;
&lt;li&gt;Swap consumer views from legacy to Iceberg&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero downtime. Consumers never see a disruption. You can validate the migration before committing to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Temporarily doubles storage costs. Requires maintaining two copies during the transition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Strategy
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8bce4men7d57s0al4wv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8bce4men7d57s0al4wv.png" alt="Decision tree for selecting the right migration strategy based on downtime tolerance and layout changes" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Recommended Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hive table (Parquet files)&lt;/td&gt;
&lt;td&gt;In-place migration, then compact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data warehouse (Snowflake, Redshift)&lt;/td&gt;
&lt;td&gt;Full rewrite via &lt;a href="https://www.dremio.com/platform/federation/" rel="noopener noreferrer"&gt;Dremio federation&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSV/JSON files in S3&lt;/td&gt;
&lt;td&gt;Full rewrite with &lt;a href="https://www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio/" rel="noopener noreferrer"&gt;COPY INTO&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL/MySQL&lt;/td&gt;
&lt;td&gt;Full rewrite or shadow migration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delta Lake tables&lt;/td&gt;
&lt;td&gt;In-place conversion or rewrite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production system (no downtime)&lt;/td&gt;
&lt;td&gt;Shadow migration with view swap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The View Swap Pattern
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hiub3xku09caj36l263.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hiub3xku09caj36l263.png" alt="The zero-downtime view swap pattern: views point to legacy first, then switch to Iceberg" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The view swap pattern is the recommended approach for production migrations. It uses &lt;a href="https://www.dremio.com/platform/semantic-layer/" rel="noopener noreferrer"&gt;Dremio's semantic layer&lt;/a&gt; to create an abstraction between consumers and the underlying data:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Federation
&lt;/h3&gt;

&lt;p&gt;Create views in Dremio that point to the legacy data source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgres_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All consumers (dashboards, reports, notebooks) query through these views. They do not know or care where the data physically lives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Build Iceberg
&lt;/h3&gt;

&lt;p&gt;Create and populate the Iceberg table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create the Iceberg table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iceberg_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;-- Backfill from the legacy source&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iceberg_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgres_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Phase 3: Validate
&lt;/h3&gt;

&lt;p&gt;Compare the two datasets to confirm data integrity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgres_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;legacy_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;iceberg_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Beyond row counts, validate aggregates (total amounts, distinct customer counts) and spot-check individual records. A comprehensive validation script should compare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total row count&lt;/li&gt;
&lt;li&gt;Column-level checksums or hash aggregates&lt;/li&gt;
&lt;li&gt;Distinct value counts for key columns&lt;/li&gt;
&lt;li&gt;Boundary values (MIN/MAX) for numeric and date columns&lt;/li&gt;
&lt;li&gt;Sample of specific records matched by primary key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only proceed to the swap after all validation checks pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Swap
&lt;/h3&gt;

&lt;p&gt;Update the view to point to the Iceberg table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consumers notice nothing. The view name is the same. The query interface is the same. But now the data is served from Iceberg with all of its advantages: &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;time travel&lt;/a&gt;, &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;hidden partitioning&lt;/a&gt;, &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;metadata-driven pruning&lt;/a&gt;, and &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;automatic optimization&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating One Table at a Time
&lt;/h2&gt;

&lt;p&gt;The view swap pattern enables incremental migration. You do not need to migrate everything at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1:&lt;/strong&gt; Migrate the highest-value table (e.g., orders)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2:&lt;/strong&gt; Migrate the next table (e.g., customers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continue&lt;/strong&gt; until all critical tables are on Iceberg&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;During the transition, &lt;a href="https://www.dremio.com/platform/federation/" rel="noopener noreferrer"&gt;Dremio's federation&lt;/a&gt; queries legacy and Iceberg tables together. A join between a PostgreSQL table and an Iceberg table works the same as a join between two Iceberg tables. The migration is invisible to consumers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Migration Checklist
&lt;/h2&gt;

&lt;p&gt;After migrating each table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;OPTIMIZE TABLE&lt;/a&gt; to ensure optimal file sizes&lt;/li&gt;
&lt;li&gt;Set up automatic optimization through &lt;a href="https://www.dremio.com/platform/open-catalog/" rel="noopener noreferrer"&gt;Dremio Open Catalog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Add wikis and tags for the &lt;a href="https://www.dremio.com/platform/ai/" rel="noopener noreferrer"&gt;AI agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Verify &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;metadata table&lt;/a&gt; health checks&lt;/li&gt;
&lt;li&gt;Decommission the legacy source after the retention period&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Migration Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Migrating without testing query performance:&lt;/strong&gt; Always benchmark critical queries against the new Iceberg table before switching production traffic. Iceberg's partition layout and file organization affect performance, and a migration can make some queries faster but others slower if the partition strategy is wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping the validation phase:&lt;/strong&gt; Data discrepancies between the old and new systems are more common than expected. Schema differences, timezone handling, null semantics, and data type precision can all cause subtle mismatches. Validate thoroughly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migrating everything at once:&lt;/strong&gt; Large "big bang" migrations carry high risk. If something goes wrong, rolling back is complex and time-consuming. Migrate one table at a time, validate each one, and build confidence incrementally.&lt;/p&gt;

&lt;p&gt;This completes the Apache Iceberg Masterclass. The series covered table formats, metadata, performance, partitioning, writes, catalogs, maintenance, tooling, and migration. For hands-on practice, start a &lt;a href="https://www.dremio.com/get-started/" rel="noopener noreferrer"&gt;Dremio Cloud trial&lt;/a&gt; and follow the workflow in &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Part 14&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Hands-On with Apache Iceberg Using Dremio Cloud</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 17:19:31 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/hands-on-with-apache-iceberg-using-dremio-cloud-fa4</link>
      <guid>https://dev.to/alexmercedcoder/hands-on-with-apache-iceberg-using-dremio-cloud-fa4</guid>
      <description>&lt;p&gt;This is Part 14 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Part 13&lt;/a&gt; covered streaming approaches. This article is a practical walkthrough of working with Iceberg on &lt;a href="https://www.dremio.com/get-started/" rel="noopener noreferrer"&gt;Dremio Cloud&lt;/a&gt;, covering table creation, data ingestion, optimization, semantic layer construction, and AI-powered analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftleblo3ht3ld9kq447vn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftleblo3ht3ld9kq447vn.png" alt="From zero to Iceberg in six steps on Dremio Cloud" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Sign Up and Connect Storage
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started/" rel="noopener noreferrer"&gt;Create a Dremio Cloud account&lt;/a&gt; (free trial available)&lt;/li&gt;
&lt;li&gt;Add a cloud storage source (S3, ADLS, or GCS) through the Sources panel&lt;/li&gt;
&lt;li&gt;Configure credentials and target bucket&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Dremio creates an &lt;a href="https://www.dremio.com/platform/open-catalog/" rel="noopener noreferrer"&gt;Open Catalog&lt;/a&gt; for your Iceberg tables automatically. This Polaris-based catalog handles metadata management, access control, and automatic optimization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create Iceberg Tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a table with &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;hidden partitioning&lt;/a&gt; by day. Users query on &lt;code&gt;order_date&lt;/code&gt; naturally; the engine handles partition pruning automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Ingest Data
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From files in object storage:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'@my_s3_source/raw/orders/'&lt;/span&gt;
&lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="s1"&gt;'parquet'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From another table or source:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgres_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.dremio.com/platform/federation/" rel="noopener noreferrer"&gt;Dremio's federation&lt;/a&gt; can query data in PostgreSQL, MySQL, Oracle, MongoDB, S3 files, and other sources directly. You can migrate data into Iceberg tables with a single INSERT...SELECT statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dremio Platform
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fio8ebk0dhghv203ygz2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fio8ebk0dhghv203ygz2l.png" alt="Dremio Cloud features for Iceberg including Open Catalog, federation, semantic layer, and AI" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Columnar Cloud Cache
&lt;/h3&gt;

&lt;p&gt;Dremio's &lt;a href="https://www.dremio.com/blog/dremios-columnar-cloud-cache-c3/" rel="noopener noreferrer"&gt;Columnar Cloud Cache (C3)&lt;/a&gt; stores frequently accessed Iceberg data on local NVMe SSDs attached to the query engine nodes. When a query accesses data for the first time, Dremio caches the relevant columns locally. Subsequent queries against the same data read from local SSD instead of remote object storage, reducing latency from hundreds of milliseconds to single-digit milliseconds.&lt;/p&gt;

&lt;p&gt;C3 operates transparently. You do not need to configure which data to cache. Dremio tracks access patterns and caches the most-queried data automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connecting BI Tools
&lt;/h3&gt;

&lt;p&gt;Dremio exposes Iceberg data through ODBC, JDBC, and Arrow Flight endpoints. Any BI tool (Tableau, Power BI, Looker, Superset) can connect to Dremio and query Iceberg tables as if they were a traditional database. The semantic layer ensures consistent governance and naming across all connected tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Layer
&lt;/h3&gt;

&lt;p&gt;Dremio's &lt;a href="https://www.dremio.com/platform/semantic-layer/" rel="noopener noreferrer"&gt;semantic layer&lt;/a&gt; lets you create governed SQL views that serve as the interface between raw data and consumers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_spend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add wikis and tags to views and tables through the Dremio UI. These descriptions help other users find and understand data, and they power the &lt;a href="https://www.dremio.com/platform/ai/" rel="noopener noreferrer"&gt;AI agent's&lt;/a&gt; ability to generate accurate SQL from natural language.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reflections (Query Acceleration)
&lt;/h3&gt;

&lt;p&gt;Dremio Reflections are precomputed materializations that automatically accelerate queries without requiring changes to your SQL. When you create a reflection on a view or table, Dremio precomputes the results and stores them as optimized Iceberg tables on fast storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create an aggregation reflection for fast dashboard queries&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_orders&lt;/span&gt;
  &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;AGGREGATE&lt;/span&gt; &lt;span class="n"&gt;REFLECTION&lt;/span&gt; &lt;span class="n"&gt;customer_orders_agg&lt;/span&gt;
  &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DIMENSIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;MEASURES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_spend&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a query matches the reflection's definition, Dremio serves it from the precomputed data instead of scanning the full table. Queries that take 30 seconds against raw data can complete in under 1 second with reflections. The query optimizer chooses the reflection transparently, so users and applications do not need to know reflections exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Governance
&lt;/h3&gt;

&lt;p&gt;Dremio provides column-level access control and row-level filtering directly in the &lt;a href="https://www.dremio.com/platform/semantic-layer/" rel="noopener noreferrer"&gt;semantic layer&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a view that masks PII for non-privileged users&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders_masked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;is_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'finance_team'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;
         &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'***MASKED***'&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Governance policies defined in the semantic layer apply consistently regardless of which tool (BI dashboard, Python notebook, AI agent) queries the data. This approach is more maintainable than duplicating access policies in every consuming application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Federation
&lt;/h3&gt;

&lt;p&gt;One of Dremio's unique capabilities is querying Iceberg tables alongside data in other systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Join Iceberg table with a PostgreSQL table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgres_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payments&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminates the need to move all data into Iceberg before you can query it. You can &lt;a href="https://www.dremio.com/blog/the-journey-from-scattered-data-to-an-apache-iceberg-lakehouse-with-governed-agentic-analytics/" rel="noopener noreferrer"&gt;start with federation and migrate incrementally&lt;/a&gt;. Federation is especially useful during &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;migration&lt;/a&gt;: query legacy systems and Iceberg tables side by side, then swap the underlying source when you are ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Essential SQL Operations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7q55ptko8wzth7jowpa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7q55ptko8wzth7jowpa.png" alt="Four essential Iceberg SQL operations on Dremio: CREATE, COPY INTO, OPTIMIZE, and time travel" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Compact small files&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;REWRITE&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;BIN_PACK&lt;/span&gt;

&lt;span class="c1"&gt;-- Compact with sorting for better file skipping&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;REWRITE&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;SORT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- Expire old snapshots&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;EXPIRE&lt;/span&gt; &lt;span class="n"&gt;SNAPSHOTS&lt;/span&gt; &lt;span class="n"&gt;OLDER_THAN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-04-01 00:00:00'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For tables managed by &lt;a href="https://www.dremio.com/platform/open-catalog/" rel="noopener noreferrer"&gt;Open Catalog&lt;/a&gt;, Dremio runs &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;automatic table optimization&lt;/a&gt; in the background, handling compaction, expiry, and orphan cleanup without user intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time Travel
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query the table as of a specific timestamp&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-03-01 00:00:00'&lt;/span&gt;

&lt;span class="c1"&gt;-- Compare current data to a previous snapshot&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;current_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;old_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;growth&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;current_data&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;old_data&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;current_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;old_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Metadata Inspection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check table health&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;-- Review recent snapshots&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  AI-Powered Analytics
&lt;/h2&gt;

&lt;p&gt;Dremio's built-in &lt;a href="https://www.dremio.com/platform/ai/" rel="noopener noreferrer"&gt;AI agent&lt;/a&gt; converts natural language questions into SQL queries using the semantic layer's wikis and tags as context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Show me the top 10 customers by total spend this quarter"&lt;/li&gt;
&lt;li&gt;"What was the month-over-month revenue growth by region?"&lt;/li&gt;
&lt;li&gt;"Which products had the highest return rate last month?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI agent generates standard SQL, meaning the results are transparent and auditable. Users can see exactly what SQL was generated, verify it, and refine it. This is different from black-box AI analytics tools that hide the underlying logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Server for External AI Agents
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.dremio.com/blog/getting-started-with-the-dremio-mcp-server/" rel="noopener noreferrer"&gt;MCP Server&lt;/a&gt; extends Dremio's data access to external AI agents and tools through the Model Context Protocol. LLMs running in Claude, ChatGPT, or custom agent frameworks can query your Iceberg lakehouse through MCP, inheriting all the governance, semantic context, and optimization that Dremio provides.&lt;/p&gt;

&lt;p&gt;This positions Dremio as the data layer for &lt;a href="https://www.dremio.com/platform/ai/" rel="noopener noreferrer"&gt;agentic AI&lt;/a&gt; workflows: the AI agent asks questions in natural language, MCP translates them into governed SQL, and Dremio returns the results from optimized Iceberg tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Part 15&lt;/a&gt; covers strategies for migrating existing data into Iceberg.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloud</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Approaches to Streaming Data into Apache Iceberg Tables</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 16:53:14 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/approaches-to-streaming-data-into-apache-iceberg-tables-27k5</link>
      <guid>https://dev.to/alexmercedcoder/approaches-to-streaming-data-into-apache-iceberg-tables-27k5</guid>
      <description>&lt;p&gt;This is Part 13 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Part 12&lt;/a&gt; covered Python and MPP engines. This article covers the three primary approaches to streaming data into Iceberg tables and the operational trade-offs each creates.&lt;/p&gt;

&lt;p&gt;Iceberg was designed for batch analytics, but most production data arrives continuously. Streaming ingestion bridges this gap by committing data to Iceberg tables at regular intervals. The challenge is that frequent commits create the &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;small file problem&lt;/a&gt;, and managing that trade-off between data freshness and table health is the central concern of streaming to Iceberg.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Three Streaming Architectures
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa63co0ch1wsa1zcap1zs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa63co0ch1wsa1zcap1zs.png" alt="Three approaches to streaming data into Iceberg: Spark, Flink, and Kafka Connect" width="760" height="760"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Spark Structured Streaming
&lt;/h3&gt;

&lt;p&gt;Spark Structured Streaming processes data in micro-batches and commits to Iceberg at configurable intervals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://checkpoint/events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processingTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60 seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each trigger creates a new Iceberg commit with the accumulated data. A 60-second trigger produces 1,440 commits per day, each adding a small number of files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Seconds to minutes (configurable via trigger interval).&lt;br&gt;
&lt;strong&gt;Small file impact:&lt;/strong&gt; Moderate. Longer trigger intervals produce fewer, larger files.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using Spark for batch processing who want to add near-real-time ingestion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Apache Flink Iceberg Sink
&lt;/h3&gt;

&lt;p&gt;Flink processes events continuously and commits to Iceberg at checkpoint intervals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Flink SQL&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;kafka_source&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flink's checkpointing mechanism determines commit frequency. A 30-second checkpoint interval produces commits every 30 seconds with whatever data has accumulated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exactly-once semantics:&lt;/strong&gt; Flink's checkpoint mechanism provides exactly-once delivery guarantees to Iceberg. If a Flink job crashes, it recovers from its last checkpoint and replays any data that was not yet committed to Iceberg. This means no duplicate records and no data loss, which is critical for financial and transactional data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partitioned writes:&lt;/strong&gt; Flink can route events to partitions dynamically based on partition transforms. Combined with Iceberg's &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;hidden partitioning&lt;/a&gt;, this means streaming data lands in the correct partition directory automatically without any special logic in the streaming application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Upserts and CDC:&lt;/strong&gt; Flink supports changelog streams (insert, update, delete operations) and can write them to Iceberg as equality deletes and data files. This enables CDC (change data capture) patterns where a database's transaction log is streamed directly into an Iceberg table, maintaining a near-real-time copy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Seconds (tied to checkpoint interval).&lt;br&gt;
&lt;strong&gt;Small file impact:&lt;/strong&gt; High. Frequent checkpoints produce many small files.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams needing the lowest-latency streaming with exactly-once semantics and CDC support.&lt;/p&gt;
&lt;h3&gt;
  
  
  Kafka Connect Iceberg Sink
&lt;/h3&gt;

&lt;p&gt;The Iceberg Sink Connector reads directly from Kafka topics and writes to Iceberg tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"iceberg-sink"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"org.apache.iceberg.connect.IcebergSinkConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"events"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"iceberg.catalog.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"iceberg.catalog.uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://catalog.example.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"iceberg.tables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"analytics.events"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Minutes (Kafka Connect batches records before committing).&lt;br&gt;
&lt;strong&gt;Small file impact:&lt;/strong&gt; Lower than Spark/Flink because commits are less frequent.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Organizations with existing Kafka infrastructure that want a managed connector approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Iceberg Sink Connector:&lt;/strong&gt; The community-maintained Iceberg Sink Connector for Kafka Connect supports schema evolution from Kafka's Schema Registry, automatic table creation, and partition routing. It reads records from Kafka topics, buffers them in memory, and commits to Iceberg in configurable batch intervals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational simplicity:&lt;/strong&gt; Kafka Connect is a managed framework. You deploy the connector configuration, and Kafka Connect handles scaling, offset management, and fault recovery. There is no custom application code to write or maintain. For organizations that already run Kafka Connect for other sinks (databases, search indexes), adding an Iceberg sink is straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Streaming + Compaction Cycle
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8fmqfnobfo7enu7r3s8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8fmqfnobfo7enu7r3s8.png" alt="Why streaming creates small files and how compaction fixes them in a continuous cycle" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every streaming approach shares the same fundamental problem: frequent commits produce small files. The solution is to pair streaming ingestion with aggressive &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;compaction&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A typical production pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stream data in&lt;/strong&gt; via Flink or Spark with 60-second commit intervals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run compaction&lt;/strong&gt; every hour to merge small files from the last hour into optimally-sized files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expire snapshots&lt;/strong&gt; daily to clean up the accumulated snapshot metadata&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;Dremio's automatic table optimization&lt;/a&gt; handles this compaction automatically for tables managed by Open Catalog. AWS &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;S3 Tables&lt;/a&gt; also provides built-in compaction for streaming workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency vs. Maintenance Trade-off
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekzj0ci8c0isrbcjyk57.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekzj0ci8c0isrbcjyk57.png" alt="The spectrum from real-time to batch showing how latency affects small file production" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t3xkylhvahd6pe6groe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t3xkylhvahd6pe6groe.png" alt="The Latency vs. Maintenance Trade-off" width="626" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key insight: you do not always need sub-second latency. Most dashboards refresh every 5-15 minutes. If your consumers can tolerate 5-minute data freshness, using a 5-minute trigger interval produces 90% fewer small files and dramatically reduces compaction overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Streaming Architecture
&lt;/h2&gt;

&lt;p&gt;A production streaming-to-Iceberg pipeline typically includes four components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Message queue&lt;/strong&gt; (Kafka, Kinesis, Pulsar): Buffers events from source systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream processor&lt;/strong&gt; (Flink, Spark Streaming): Transforms and writes to Iceberg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compaction service&lt;/strong&gt; (&lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;Dremio auto-optimization&lt;/a&gt;, Spark scheduled jobs): Merges small files on a recurring schedule&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt; (&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;metadata tables&lt;/a&gt;): Tracks file counts, sizes, and commit frequency&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The most common mistake in streaming Iceberg architectures is deploying the stream processor without the compaction service. Without compaction, query performance degrades within days. Always deploy both together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o54cxtyw3gdvawnkp8r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o54cxtyw3gdvawnkp8r.png" alt="Choosing the Right Approach" width="667" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring Streaming Health
&lt;/h3&gt;

&lt;p&gt;After deploying a streaming pipeline, monitor these metrics daily using &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;metadata tables&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Commit frequency:&lt;/strong&gt; How many snapshots are being created per hour?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average file size:&lt;/strong&gt; Is the small file problem growing?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compaction lag:&lt;/strong&gt; Are compaction jobs keeping up with the write rate?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end latency:&lt;/strong&gt; How long between an event occurring and it being queryable in Iceberg?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A well-tuned streaming pipeline commits every 1-5 minutes, produces files of 32-128 MB per commit, and has compaction running every 30-60 minutes to consolidate the small files into 256 MB targets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Part 14&lt;/a&gt; provides a hands-on walkthrough of Iceberg on Dremio Cloud.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Using Apache Iceberg with Python and MPP Query Engines</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 16:35:48 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/using-apache-iceberg-with-python-and-mpp-query-engines-1d0</link>
      <guid>https://dev.to/alexmercedcoder/using-apache-iceberg-with-python-and-mpp-query-engines-1d0</guid>
      <description>&lt;p&gt;This is Part 12 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Part 11&lt;/a&gt; covered metadata tables. This article covers the two main ways to access Iceberg data: directly from Python libraries and through MPP (massively parallel processing) query engines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Python Ecosystem for Iceberg
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftp65rv53zkbxhq47xswk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftp65rv53zkbxhq47xswk.png" alt="How Python libraries and MPP engines connect to Iceberg tables" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  PyIceberg: Native Python Access
&lt;/h3&gt;

&lt;p&gt;PyIceberg is the official Python library for Apache Iceberg. It reads Iceberg metadata directly and can scan data files without an external query engine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_catalog&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to a REST catalog
&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://catalog.example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Load and scan a table
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics.orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyn6npc16xnj86i5xtb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyn6npc16xnj86i5xtb8.png" alt="The five-step PyIceberg workflow from catalog connection to analysis" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PyIceberg leverages Iceberg's &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;metadata-driven pruning&lt;/a&gt;: the &lt;code&gt;row_filter&lt;/code&gt; is pushed down to manifest evaluation, so only relevant data files are read. For reading subsets of large tables into Python for analysis or ML training, this is remarkably efficient.&lt;/p&gt;

&lt;p&gt;PyIceberg also supports writes (appending data from Arrow tables), schema evolution, and table management operations. It connects to any catalog that implements the REST protocol, including &lt;a href="https://www.dremio.com/platform/open-catalog/" rel="noopener noreferrer"&gt;Dremio Open Catalog&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  DuckDB: SQL-Based Python Analysis
&lt;/h3&gt;

&lt;p&gt;DuckDB can read Iceberg tables through its Iceberg extension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSTALL iceberg; LOAD iceberg;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT customer_id, SUM(amount) as total
    FROM iceberg_scan(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://warehouse/orders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
    GROUP BY customer_id
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchdf&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DuckDB processes the query locally using its columnar execution engine, which is significantly faster than pandas for analytical queries. It supports Iceberg's partition pruning and column statistics for file skipping. DuckDB runs entirely in-process, so there is no separate server to manage. This makes it a strong choice for local analysis, CI/CD data validation, and notebooks where starting a Spark cluster would be overkill.&lt;/p&gt;

&lt;p&gt;DuckDB also supports reading Iceberg metadata tables, which means you can use it for &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;table health diagnostics&lt;/a&gt; without standing up a full query engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Polars: High-Performance DataFrames
&lt;/h3&gt;

&lt;p&gt;Polars can read Iceberg tables through its &lt;code&gt;scan_iceberg&lt;/code&gt; method, providing lazy evaluation and parallel processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_iceberg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://warehouse/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polars uses a lazy evaluation model: the &lt;code&gt;scan_iceberg&lt;/code&gt; call does not read data immediately. Instead, it builds an execution plan. When &lt;code&gt;collect()&lt;/code&gt; is called, Polars optimizes the plan (predicate pushdown, column pruning, parallel reads) and executes it. For large Iceberg tables, Polars can scan data several times faster than pandas because it uses all available CPU cores and processes data in Apache Arrow columnar format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing from Python
&lt;/h3&gt;

&lt;p&gt;PyIceberg supports writes through Apache Arrow tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;

&lt;span class="c1"&gt;# Create an Arrow table with new data
&lt;/span&gt;&lt;span class="n"&gt;new_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1002&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1003&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;150.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;275.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;89.99&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-03-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-03-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-03-16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Append to the Iceberg table
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a new Iceberg &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;commit&lt;/a&gt; with the data files, manifests, and metadata. PyIceberg handles the entire write lifecycle, including partition assignment based on the table's &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;partition spec&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For bulk writes from Python, using PyIceberg with Arrow is often simpler than setting up Spark. However, PyIceberg runs on a single machine, so it is not suitable for writing terabyte-scale datasets. For that, use an MPP engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  MPP Query Engines
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14lyzx7w9f5kr31doaux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14lyzx7w9f5kr31doaux.png" alt="Comparison of MPP engines for Iceberg workloads showing read, write, and maintenance capabilities" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For production workloads at scale, Python libraries running on a single machine are not sufficient. MPP engines distribute query execution across multiple nodes, handling petabyte-scale tables with sub-minute response times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dremio
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; provides full Iceberg support with several unique capabilities: &lt;a href="https://www.dremio.com/platform/federation/" rel="noopener noreferrer"&gt;query federation&lt;/a&gt; across Iceberg and non-Iceberg sources, &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;automatic table optimization&lt;/a&gt; through Open Catalog, a &lt;a href="https://www.dremio.com/platform/semantic-layer/" rel="noopener noreferrer"&gt;semantic layer&lt;/a&gt; for governed access, and &lt;a href="https://www.dremio.com/platform/ai/" rel="noopener noreferrer"&gt;AI-powered analytics&lt;/a&gt; through its built-in agent and MCP server.&lt;/p&gt;

&lt;p&gt;For Python users, Dremio exposes data through Apache Arrow Flight, which is a high-performance data transfer protocol. Arrow Flight sends data in columnar Arrow format directly to the client, avoiding the serialization overhead of JDBC/ODBC. This makes it 10-100x faster than traditional database connectors for large result sets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dremio_simple_query&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DremioConnection&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DremioConnection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-dremio.cloud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM analytics.orders WHERE amount &amp;gt; 100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is a pandas DataFrame populated via Arrow Flight. Because the data stays in Arrow format end-to-end (Iceberg Parquet to Dremio to Arrow Flight to pandas), there are no format conversion bottlenecks.&lt;/p&gt;

&lt;p&gt;Dremio also provides a &lt;a href="https://www.dremio.com/blog/dremios-columnar-cloud-cache-c3/" rel="noopener noreferrer"&gt;Columnar Cloud Cache&lt;/a&gt; that stores frequently accessed data on local NVMe drives, making subsequent queries against the same Iceberg data dramatically faster without requiring reflections or materialized views.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spark
&lt;/h3&gt;

&lt;p&gt;Apache Spark is the most mature Iceberg engine for both reads and writes. It handles batch ETL, streaming ingestion (&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Part 13&lt;/a&gt;), and all &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;maintenance operations&lt;/a&gt;. Most Iceberg production pipelines use Spark for data ingestion because of its extensive connector ecosystem (Kafka, JDBC, file formats) and its ability to process large volumes across a distributed cluster.&lt;/p&gt;

&lt;p&gt;Spark supports all Iceberg operations: CREATE, INSERT, MERGE, DELETE, UPDATE, schema evolution, partition evolution, and every maintenance procedure (compaction, snapshot expiry, orphan cleanup).&lt;/p&gt;

&lt;h3&gt;
  
  
  Trino
&lt;/h3&gt;

&lt;p&gt;Trino (formerly PrestoSQL) is optimized for interactive, ad-hoc queries with low latency. It reads and writes Iceberg tables and supports the REST catalog protocol. Trino is popular for exploration and dashboarding workloads where sub-second response times matter and data is being read rather than written. Its architecture keeps no persistent state, making it easy to scale up and down based on query demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Engines
&lt;/h3&gt;

&lt;p&gt;Several other engines provide Iceberg support: AWS Athena (serverless, AWS-native), Snowflake (read-only for external Iceberg tables), StarRocks (sub-second analytics), and Doris (real-time analytics). The Iceberg community maintains a &lt;a href="https://iceberg.apache.org/multi-engine-support/" rel="noopener noreferrer"&gt;compatibility matrix&lt;/a&gt; showing which engines support which operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing the Right Approach
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2xuol1ttm6qrz5npe3h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2xuol1ttm6qrz5npe3h.png" alt="Choosing the Right Approach" width="492" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key takeaway: Python libraries (PyIceberg, DuckDB, Polars) are best for local analysis and development. MPP engines (Dremio, Spark, Trino) are necessary for production-scale analytics. Many teams use both: PyIceberg for data science experimentation, and Dremio for production dashboards and governed access.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Part 13&lt;/a&gt; covers how to stream data into Iceberg tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Apache Iceberg Metadata Tables: Querying the Internals</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 15:45:10 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/apache-iceberg-metadata-tables-querying-the-internals-jgb</link>
      <guid>https://dev.to/alexmercedcoder/apache-iceberg-metadata-tables-querying-the-internals-jgb</guid>
      <description>&lt;p&gt;This is Part 11 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Part 10&lt;/a&gt; covered maintenance operations. This article covers the metadata tables that let you inspect Iceberg table internals using standard SQL.&lt;/p&gt;

&lt;p&gt;Iceberg exposes its internal metadata as queryable virtual tables. You can use them to check table health, debug performance issues, audit changes, and build monitoring dashboards. No special tools required, just SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Seven Metadata Tables
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczw4r6hczsi0zftnscuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczw4r6hczsi0zftnscuh.png" alt="The seven Iceberg metadata tables and what each reveals about your table" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Snapshots
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;$snapshots&lt;/code&gt; table lists every snapshot in the table's history. Each row represents a committed transaction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Dremio syntax&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;-- Spark syntax&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;snapshots&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key columns: &lt;code&gt;snapshot_id&lt;/code&gt;, &lt;code&gt;committed_at&lt;/code&gt;, &lt;code&gt;operation&lt;/code&gt; (append, overwrite, delete), &lt;code&gt;summary&lt;/code&gt; (files added/removed counts).&lt;/p&gt;

&lt;h3&gt;
  
  
  History
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;$history&lt;/code&gt; table shows the timeline of which snapshot was current at each point in time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Files
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;$files&lt;/code&gt; table lists every data file in the current snapshot with detailed statistics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;partition&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the primary diagnostic table for checking &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;file sizes&lt;/a&gt; and identifying the small file problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manifests
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;$manifests&lt;/code&gt; table lists the manifest files for the current snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;added_data_files_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existing_data_files_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_manifests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Partitions
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;$partitions&lt;/code&gt; table provides statistics per partition: row counts, file counts, and size.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_partitions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1hmfuikrc808eo5wgqq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1hmfuikrc808eo5wgqq.png" alt="Three categories of metadata table use cases: monitoring, debugging, and auditing" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring: Average File Size
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_file_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_file_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_files&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;avg_file_mb&lt;/code&gt; drops below 64, schedule compaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging: Files Per Partition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;partition&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Partitions with hundreds of files are compaction candidates. Use this query as a daily health check and pipe the results into your monitoring system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging: Sort Order Effectiveness
&lt;/h3&gt;

&lt;p&gt;Column statistics in the files table reveal whether your sort order is effective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;lower_bounds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;upper_bounds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_customer_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the min/max ranges overlap heavily across files, the sort order has decayed and compaction with sorting (&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Part 10&lt;/a&gt;) will restore effectiveness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring: Commit Velocity
&lt;/h3&gt;

&lt;p&gt;Track how frequently the table is being written to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'added-data-files'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;files_added&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24'&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;High commit velocity (hundreds of commits per hour) indicates a &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;streaming workload&lt;/a&gt; that needs aggressive compaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auditing: Recent Changes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows the last 10 operations: how many files were added or removed per commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time Travel
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhy2rylqzbvha2fhyeeio.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhy2rylqzbvha2fhyeeio.png" alt="How snapshots enable querying the table at any point in its history" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metadata tables enable time travel queries. Use the snapshot list to find the snapshot ID for a specific point in time, then query the table at that snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query the table as it existed on February 15&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="n"&gt;SNAPSHOT&lt;/span&gt; &lt;span class="s1"&gt;'1234567890123456789'&lt;/span&gt;

&lt;span class="c1"&gt;-- Or by timestamp&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-15 00:00:00'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Time travel is useful for debugging data issues ("what did this table look like before yesterday's pipeline ran?"), auditing ("what was the account balance at end-of-quarter?"), and reproducible analysis ("run this report against last month's data").&lt;/p&gt;

&lt;h3&gt;
  
  
  Incremental Reads
&lt;/h3&gt;

&lt;p&gt;Metadata tables also enable incremental processing. By comparing two snapshots, you can identify which files were added between them and process only the new data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Find files added in the last snapshot&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="n"&gt;SNAPSHOT&lt;/span&gt; &lt;span class="s1"&gt;'1234567890'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern is the foundation for CDC (Change Data Capture) on Iceberg tables: read only what changed since the last processing run, rather than re-scanning the entire table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollback
&lt;/h3&gt;

&lt;p&gt;If a bad write corrupts your table, use the snapshot list to rollback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Find the last good snapshot&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;committed_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;

&lt;span class="c1"&gt;-- Rollback to it (Spark)&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rollback_to_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1234567890&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rollback does not delete data. It simply changes the current snapshot pointer to an earlier snapshot, making the table appear as it was at that point. The rolled-back data files remain in storage for potential recovery.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.dremio.com/cloud/sonar/query-manage/querying-metadata/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; supports all Iceberg metadata table queries through its TABLE() function syntax and provides time travel in both SQL and its semantic layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Health Dashboard
&lt;/h2&gt;

&lt;p&gt;Combine metadata table queries into a scheduled monitoring job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Table health summary&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_manifests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;manifests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set alerts when snapshots exceed 1,000, average file size drops below 64 MB, or manifest count exceeds 500.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engine Syntax Variations
&lt;/h3&gt;

&lt;p&gt;Different engines use different syntax for metadata tables:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv232pj62lbi1dz7y1k23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv232pj62lbi1dz7y1k23.png" alt="Engine Syntax Variations" width="365" height="223"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The underlying data is identical; only the SQL syntax differs. Regardless of which engine you use, these metadata tables are the key diagnostic tool for understanding and maintaining Iceberg table health.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automating Decisions with Metadata
&lt;/h3&gt;

&lt;p&gt;You can use metadata table queries to drive automated maintenance decisions. For example, a scheduler can check whether compaction is needed before running it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Only compact if average file size is below threshold&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;CASE&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_size_in_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'COMPACT_NEEDED'&lt;/span&gt;
  &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'HEALTHY'&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This avoids running compaction on tables that are already well-organized, saving compute costs and preventing unnecessary data rewrites.&lt;/p&gt;

&lt;p&gt;For production environments, integrate these checks into your orchestration tool (Airflow, Dagster, Prefect). Schedule a daily metadata scan across all tables, collect the health metrics, and trigger maintenance jobs only for tables that need them. This approach scales to hundreds of tables without manual oversight. &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;Dremio's autonomous optimization&lt;/a&gt; automates this entire workflow for tables managed by Open Catalog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Part 12&lt;/a&gt; covers using Iceberg from Python and MPP query engines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 22 May 2026 15:33:50 +0000</pubDate>
      <link>https://dev.to/alexmercedcoder/maintaining-apache-iceberg-tables-compaction-expiry-and-cleanup-12en</link>
      <guid>https://dev.to/alexmercedcoder/maintaining-apache-iceberg-tables-compaction-expiry-and-cleanup-12en</guid>
      <description>&lt;p&gt;This is Part 10 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;Part 9&lt;/a&gt; covered how tables degrade. This article covers the four maintenance operations that keep Iceberg tables healthy and the three approaches to running them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Four Maintenance Operations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhve8wtydyv40g5wkqis3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhve8wtydyv40g5wkqis3.png" alt="The four Iceberg maintenance operations: compaction, snapshot expiry, orphan cleanup, and manifest rewriting" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Compaction (File Rewriting)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd3kjkvucr69s0jfh30b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd3kjkvucr69s0jfh30b.png" alt="Compaction merging 500 small files into 2 large files with identical data" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compaction reads small files, merges them into optimally-sized files (128-512 MB), and optionally re-sorts the data. It is the most impactful maintenance operation because it directly addresses the &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;small file problem&lt;/a&gt; and restores sort order effectiveness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Spark:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_data_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In &lt;a href="https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;REWRITE&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;BIN_PACK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compaction with sorting rewrites files so that column values are ordered, tightening the min/max statistics and making &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;file skipping&lt;/a&gt; far more effective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;REWRITE&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;SORT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Snapshot Expiry
&lt;/h3&gt;

&lt;p&gt;Snapshot expiry removes old snapshots from the metadata. After expiry, the snapshot and its exclusive data files are eligible for cleanup. You typically retain snapshots for a window (e.g., 7 days) to support time travel, then expire everything older.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Spark&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expire_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-04-22 00:00:00'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- Dremio&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;EXPIRE&lt;/span&gt; &lt;span class="n"&gt;SNAPSHOTS&lt;/span&gt; &lt;span class="n"&gt;OLDER_THAN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-04-22 00:00:00'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Orphan File Cleanup
&lt;/h3&gt;

&lt;p&gt;After snapshots are expired, the data files they exclusively referenced become orphans. Orphan cleanup scans the storage directory, compares files against the current metadata, and deletes files that are not referenced by any snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Spark&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remove_orphan_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This operation should run after snapshot expiry and with a safety delay (e.g., files older than 3 days) to avoid deleting files from in-progress writes.&lt;/p&gt;

&lt;p&gt;Running orphan cleanup too aggressively can delete files from long-running write operations. A 3-day safety window ensures that any write operation has had time to complete before its files are considered orphans.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Manifest Rewriting
&lt;/h3&gt;

&lt;p&gt;Over many commits, manifests accumulate. A single snapshot's manifest list might reference hundreds of small manifests from individual commits. Manifest rewriting consolidates them into fewer, larger manifests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Spark&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_manifests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This speeds up scan planning because the engine reads fewer manifest files. Each manifest file requires a separate I/O operation to read, so reducing the count from 500 to 20 eliminates 480 I/O round trips during query planning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sort-Order Compaction
&lt;/h3&gt;

&lt;p&gt;Standard compaction (BIN_PACK) merges small files without changing the data order. Sort-order compaction rewrites files with data sorted by specified columns, which tightens the min/max statistics and makes &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;file skipping&lt;/a&gt; more effective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Dremio sort-order compaction&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;REWRITE&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;SORT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- Spark sort-order compaction&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_data_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'analytics.orders'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'sort'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sort_order&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'order_date ASC NULLS LAST, customer_id ASC NULLS LAST'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sort-order compaction is more expensive than BIN_PACK because it reads, sorts, and rewrites all data. However, the performance improvement for queries that filter on the sorted columns is substantial: file skipping can eliminate 90%+ of data files when the sort columns match common query filters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Retention Policies
&lt;/h3&gt;

&lt;p&gt;Decide how long to keep historical data accessible through time travel:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqxm6xmxtrjq0i9hubhz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqxm6xmxtrjq0i9hubhz.png" alt="Data Retention Policies" width="568" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Longer retention means more snapshots, more metadata, and more storage consumed by old data files. Shorter retention reduces costs but limits time travel capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Approaches to Maintenance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6l4tvdgzb4v7m6zokysr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6l4tvdgzb4v7m6zokysr.png" alt="Comparison of automated versus manual maintenance approaches" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual (Scheduled Jobs)
&lt;/h3&gt;

&lt;p&gt;Run maintenance operations on a schedule using Spark, Trino, or Dremio. A typical pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run compaction daily for heavily-written tables&lt;/li&gt;
&lt;li&gt;Expire snapshots older than 7 days&lt;/li&gt;
&lt;li&gt;Remove orphan files older than 3 days&lt;/li&gt;
&lt;li&gt;Rewrite manifests monthly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Full control over timing and configuration. &lt;strong&gt;Cons:&lt;/strong&gt; Requires operational effort; forgotten or broken jobs lead to degradation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semi-Automated (Scheduled with Monitoring)
&lt;/h3&gt;

&lt;p&gt;Build a monitoring layer that checks table health metrics (&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;Part 9&lt;/a&gt; diagnostics) and triggers maintenance only when thresholds are exceeded (e.g., average file size drops below 64 MB).&lt;/p&gt;

&lt;h3&gt;
  
  
  Fully Automated
&lt;/h3&gt;

&lt;p&gt;Use a platform that handles maintenance autonomously. &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;Dremio's automatic table optimization&lt;/a&gt; runs compaction, expiry, and cleanup for tables managed by Open Catalog without any user configuration. AWS &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;S3 Tables&lt;/a&gt; provides built-in compaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi2pm42ypemost169549.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi2pm42ypemost169549.png" alt="Fully Automated" width="594" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Maintenance Schedule
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dvr3pkny120tpzkprtn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dvr3pkny120tpzkprtn.png" alt="Recommended Maintenance Schedule" width="668" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For most teams, starting with &lt;a href="https://www.dremio.com/platform/reflections/" rel="noopener noreferrer"&gt;Dremio's autonomous optimization&lt;/a&gt; and only adding manual jobs for tables with unusual requirements is the most practical approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Maintenance Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Running compaction during peak query hours:&lt;/strong&gt; Compaction reads and rewrites data files, which competes with analytical queries for I/O bandwidth. Schedule compaction during off-peak hours, or use a separate compute cluster (Spark on EMR) that does not share resources with your query engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expiring snapshots too aggressively:&lt;/strong&gt; If you expire snapshots while a long-running query is using one of them, the query can fail because the data files it needs might be cleaned up. Always keep snapshots for at least as long as your longest-running query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forgetting orphan cleanup:&lt;/strong&gt; Many teams run compaction and snapshot expiry but forget orphan cleanup. Without it, compacted and expired data files accumulate indefinitely. Set up orphan cleanup as a weekly job with a 3-day safety window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not monitoring after migration:&lt;/strong&gt; Tables migrated from Hive or other formats (&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Part 15&lt;/a&gt;) often inherit poor file layouts. Run an immediate compaction pass after any in-place migration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Part 11&lt;/a&gt; covers how to query the metadata tables that power diagnostics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
