Alex Merced

Posted on Jun 9

Apache Iceberg v4: The Current State, the Proposals, and Why They Matter

A few years ago the question about Apache Iceberg was whether open table formats could replace proprietary warehouses. That question is closed. Iceberg won. The new question is sharper and more interesting. What do we do with it next?

That is the question driving Iceberg v4.

At Iceberg Summit 2026 in San Francisco, more than 600 people gathered for two days and over 70 sessions. Not one talk tried to convince the room to adopt Iceberg. Every session assumed you already run it in production. The energy went somewhere else. It went to the limitations that success created, and to the spec changes that fix them.

This post walks through the state of v4 as of June 2026. It covers each major proposal, how the proposal works at a technical level, and why it matters for the people who run Iceberg at scale. It also covers the live debates, since v4 is not finished and the arguments on the dev list tell you as much as the design documents do.

Where v4 stands today

Start with the honest status. Iceberg v4 is not released. It is not finalized. It exists as design documents, GitHub issues, Iceberg Enhancement Proposals, and long threads on the dev mailing list. The current stable release is 1.10.0 from September 2025, and that release sits firmly in the v3 era.

The practical guidance has not changed. Treat v3 as the production target. Treat v4 as the horizon worth watching. Build on what is stable and tested rather than waiting on features that have no committed ship date.

That said, v4 is no longer a vague wish list. The Summit made that clear. The proposals presented there were not academic. They were direct answers to operational pain that real teams hit at scale. And the people shaping them are the people who feel that pain most. Engineers from Google, Apple, Snowflake, Databricks, Microsoft, Netflix, and LinkedIn sit in the same design discussions and review the same pull requests. That is part of why the community trusts the direction even before the vote.

You can already see fragments of v4 leaking into the official spec text. The spec now describes behavior for "v4 and later" when it talks about how a table location is handled. That is a small detail, but it signals that the spec authors have started writing v4 semantics into the document itself.

How Iceberg metadata works today, in plain terms

To understand the proposals, you need a quick mental model of how Iceberg tracks a table right now. Skip this section if you already know it cold.

Iceberg replaced the old Hive approach of tracking data by directory. Hive mapped each partition to a folder and treated every file in that folder as part of the table. That worked on HDFS where directory listings were fast. It broke on object storage like S3, where listing millions of files across nested partitions got slow and expensive, and where request-rate throttling caused real outages.

Iceberg fixed this by tracking individual files through a tree of metadata. The tree has a few layers.

Data files hold the actual rows, usually in Parquet. Manifest files list groups of data files along with per-file statistics like row counts and the min and max value of each column. A manifest list collects all the manifests that make up one snapshot. A metadata file, written as JSON, points to the current snapshot and stores table-level details like schema, partition spec, sort orders, and snapshot history.

Every commit produces a new immutable snapshot. Readers get a consistent point-in-time view. Writers add data through atomic swaps of the metadata pointer. This is what gives Iceberg time travel, rollback, and snapshot isolation on cheap object storage.

The payoff of this tree shows up at query time. An engine reads the metadata, checks the per-file statistics, and skips any file whose min and max values cannot match the query filter. It does this without listing directories or opening data files. Scan planning becomes a metadata lookup rather than a full scan of the storage layout. A single table can hold tens of petabytes, and an engine can still plan a query against it quickly, since it reads metadata instead of crawling files. That property is the core architectural advantage of Iceberg, and every v4 proposal is careful to protect it.

The spec has grown in clear stages. V1 set the foundation with immutable data files, snapshots, hidden partitioning, and safe schema evolution. V2 added delete files, which let engines mark rows for removal without rewriting whole data files. That made row-level updates and merge-on-read practical, and it powered change data capture and GDPR deletions. V3, shipped across the 1.8 through 1.10 releases in 2025, added binary deletion vectors, the variant type for semi-structured data, native geometry and geography types, nanosecond timestamps, row lineage, default column values, multi-argument partition transforms, and table encryption keys.

Each version solved real problems. And each version exposed the next set of problems. That brings us to v4.

The pattern behind v4 is consistent. Iceberg was built for large, slow-moving analytical tables. The workloads people run on it now are anything but slow-moving. Streaming pipelines commit every few seconds. Machine learning feature tables carry thousands of columns. Disaster recovery plans demand that a table can move between buckets and regions. The metadata design that served batch analytics well becomes the bottleneck under these new patterns. V4 attacks that bottleneck from several angles at once.

Proposal one: adaptive metadata trees and single-file commits

This is the headline proposal, and the most ambitious one.

Look at what a commit costs today. Even a tiny write produces a new metadata.json, a new manifest list, and one or more new manifest files. The change might touch one data file. The metadata work touches several files anyway. This is write amplification, and it shows up as commit latency.

For a batch job that runs once an hour, the cost is invisible. For a streaming job that commits every few seconds, the cost is fatal. The metadata writing dominates, the small files pile up, and object storage starts throttling requests against the shared prefix. Delete operations make it worse. Under copy-on-write, a delete can trigger a full manifest rewrite. Caching manifests across commits gets hard, since the files keep getting replaced.

The v4 answer is a restructured metadata tree built around a Root Manifest. The Root Manifest replaces the old manifest list and serves as the single entry point for each snapshot. The hierarchy collapses into a clean two-level shape.

Root Manifest -> Data Manifests / Delete Manifests / Files

The key behavior is that a commit modifies only what changed. Metadata growth becomes proportional to the size of the operation, not the size of the table. A one-file write produces a one-file change. The benefits land immediately. Commits get faster. Rewrites get rarer. Query planning improves too, since the Root Manifest can aggregate file-level metrics from its children, which lets engines prune earlier in planning.

The word "adaptive" is the important part of the design. The proposal does not force every write to be a single-file commit. Small writes can be inlined directly into the root for low latency. As the root fills up, background maintenance rebalances entries down into leaf manifests. Writers can also choose to pay the rebalancing cost at a moment that suits them. The structure adapts to the workload. A streaming table keeps its hot writes near the root for speed. A batch table behaves more like the classic layout. One spec, two operating modes, chosen by the shape of the work.

This is the proposal that enables low-latency writes without giving up read performance on huge tables. That combination is the whole point. Streaming wants fast small commits. Analytics wants fast pruning over petabytes. The adaptive tree tries to serve both from one structure.

Make this concrete. Picture a Flink job pulling from Kafka and committing to an Iceberg table every five seconds. Under v3, each of those commits writes a fresh metadata.json, a fresh manifest list, and at least one new manifest, even when the commit added a single small data file. Over an hour that is more than 700 commits, each one multiplying a tiny data write into several metadata writes. The small files pile up against one storage prefix and trigger throttling. Teams work around this with frequent compaction jobs that fight the ingestion they are trying to support. Under the v4 adaptive tree, those same 700 commits inline their tiny changes near the root and rebalance in the background. The write path stops multiplying. The compaction pressure drops. The streaming job and the table maintenance stop competing.

Now the honest part. This proposal carries the liveliest debate on the dev list, and the questions are good ones.

If small commit entries get inlined into the root, then a reader has to scan those inlined entries to plan a query. People asked whether the spec accepts a linear scan cost as the price of write throughput, or whether there is a pre-index mechanism that avoids decoding data pages for every sub-second query. People also asked about the catalog. A REST catalog under high concurrency might have to perform partial Parquet decodes on hundreds of inlined entries per request. That risks turning the catalog into a mini query engine just to do basic partition pruning. And there is a circular risk. If the fix for scan cost is to flush entries to leaf manifests more often, then you reintroduce the frequent-small-file problem and the object storage throttling that single-file commits were meant to solve in the first place.

None of these questions are fatal. They are the normal tension between write speed and read speed, played out in a new structure. But they are the reason v4 is still a proposal and not a release. The community is working through the amortized cost analysis. How big should the root buffer be. How often should rebalancing run. How do different workloads at different scale factors change the answer. These are the details that get settled over months before a vote.

Proposal two: storing metadata in Parquet instead of Avro

Since the early versions, Iceberg has stored its metadata files in Apache Avro. Avro is row-based. That choice was sensible when manifests were small and engines read them as whole records.

Tables grew. Manifests grew with them. A wide table can carry hundreds of columns, and each manifest entry then carries hundreds of per-column statistics. The problem is that Avro forces an engine to deserialize an entire record even when it needs only a sliver of it. During query planning, an engine often wants just the file path and the min and max of a single column. With Avro it pays to read everything.

The v4 proposal moves metadata to a columnar format using Apache Parquet. This is the same format that already stores the data in most Iceberg tables. The win is direct. An engine can read only the columns of metadata it needs. Column pruning and predicate pushdown, the same tricks that make Parquet fast for data, now apply to metadata queries too. Memory use drops. Planning gets faster on wide tables.

There is a pleasing symmetry here. Metadata storage starts to look like data storage. The same engine machinery that scans Parquet data files can scan Parquet metadata files. And this proposal pairs naturally with the adaptive metadata tree. As the metadata gets richer and more expressive, columnar reads keep planning fast. You get more detail in the metadata without paying to read all of it on every query.

The change does raise a compatibility question that the community has to handle with care. Every existing engine reads Avro metadata today. A move to Parquet metadata means every reader and writer needs to learn the new format, and tables written under v4 with Parquet metadata will not open in an engine that only knows the older layout. This is the normal cost of a format version bump, and it is why v4 is a new spec version rather than a patch. Engines will add v4 support over a period of months, the same way v3 support rolled out across the ecosystem during 2025. The reward is worth the transition. Metadata reads stop being a tax that grows with table width.

The dependency runs both ways with the column statistics rework, which is the next proposal. Columnar metadata is the container. Better-typed statistics are part of what fills it.

Proposal three: reworking column statistics into first-class data

This proposal sounds small. It is not. It quietly opens the door to a class of workloads Iceberg was never designed for.

Look at how stats work today. For each column, Iceberg stores lower and upper bounds, null counts, and value sizes as a generic map from a field ID to a value. The map is flexible, and it functions, but it has three weaknesses. It is inefficient for wide tables, since you carry a big map per file. It loses type information during serialization, so an engine cannot always trust the physical and logical type of a bound. And it makes it hard to project only the specific stats you want, since the map is opaque.

The v4 proposal introduces a typed, structured representation of column statistics. Each field's stats get stored with their logical and physical types preserved. That makes them reliable across schema evolution, where types and IDs shift over a table's life. Engines can read individual stats, like just the lower bounds for three columns, without loading the whole stats payload into memory.

The part that matters most is extensibility. A typed, structured stats model lets developers attach richer per-field metrics. For a variant column you might attach stats that describe its nested fields. For a geometry column you might attach a bounding box. And the structure can hold entirely new kinds of metrics. This is where vector search enters the conversation.

Approximate nearest neighbor search, the operation at the heart of vector databases and retrieval for AI, needs index structures that the current stats map simply cannot express. By rebuilding column statistics for flexibility, v4 opens the door to new index types that support these queries. An Iceberg table could carry the metadata needed to prune candidates for a similarity search the same way it prunes files for a range filter today. That turns Iceberg into a more serious home for the feature and embedding tables that AI workloads generate.

The chain of dependencies is now visible. Columnar Parquet metadata gives you a container that supports column pruning. Typed statistics give you the structured, extensible content to put in that container. The adaptive tree keeps commits cheap so you can write and update all of this without write amplification. The three proposals are not independent. They are one coordinated redesign of the metadata layer, split into pieces that can be reviewed and voted on.

Proposal four: relative paths and relocatable tables

This proposal fixes an operational headache that has annoyed teams for years.

Iceberg stores file references as absolute URIs. Every manifest and metadata file embeds the full path to the files it points at, including the bucket and region. That was a deliberate early decision. Absolute paths solved real consistency problems on eventually-consistent object stores, where a stale or ambiguous reference could corrupt a read.

The cost shows up the moment you need to move a table. Copy a table to a new bucket, a new region, or a different storage system, and every embedded path is now wrong. You have to rewrite the metadata to point at the new location. For a large table with deep metadata, that rewrite is slow and expensive. It turns routine operations into projects. Replication, disaster recovery backups, and multi-region deployments all run into this wall.

The v4 proposal adds support for relative paths inside table metadata. References get stored relative to the table root rather than as absolute URIs. Move the table root, and the internal relationships between metadata and data files stay valid without a rewrite. Copy the whole directory tree somewhere else, and it just works. Absolute paths remain available where you still need them, such as references to external data that lives outside the table root.

The payoff is portability. A table becomes a self-contained, relocatable unit. You can replicate it to another region for disaster recovery and not pay a metadata rewrite tax. You can clone it for testing. You can migrate it between storage systems during a cloud transition. The Summit framing put it plainly. Relative paths eliminate entire categories of expensive metadata rewrites.

This is the proposal that is furthest along in the spec text. The spec already describes how table location works for "v4 and later," and the model assumes a catalog will provide the table's location rather than baking it into every file reference. That is a clean separation. The catalog knows where the table lives. The metadata describes the table's internal structure in terms relative to that location.

Proposal five: column families and efficient column updates for wide AI tables

This proposal targets the workload that did not exist when Iceberg was designed. Wide tables for machine learning.

Picture a feature table with 200 columns, or an embedding table where each row holds a large vector. Now picture a daily job that recomputes 5 or 10 of those features and leaves the rest untouched. Or a job that refreshes prediction scores after a model retrains. Or one that regenerates vector embeddings after a new embedding model ships.

In Iceberg today, all of these jobs pay the same brutal price. Updating any column means rewriting the entire row. A small update to a handful of features forces a full rewrite of files that hold all 200 columns. At petabyte scale this is cost-prohibitive. The write amplification is enormous. You touch 5 percent of the data and rewrite 100 percent of the files.

The proposal, tracked in GitHub issue 15146 as "Efficient column updates in Iceberg," attacks this directly. The idea is to write only the updated columns to separate column files and leave the unchanged columns sitting in the original base files. At read time, the engine stitches the column files together with the base files to materialize complete rows. You update the embedding column by writing a new embedding column file. The other 199 columns never move.

This is the column families pattern, and the Summit described it as first-class support for wide tables. Column groups get stored and evolved independently. New features can be backfilled into a table without touching the rest of it. A team can add a column family of fresh features and write only that family.

The use cases the proposal calls out map exactly to AI pipelines. Model score updates after retraining. Embedding refresh, which today triggers a full row rewrite. Incremental feature computation, where a daily batch touches a tiny fraction of a wide table's columns. These are not edge cases for AI teams. They are the daily routine.

This proposal leans hard on the others. It builds on single-file commits and on the column statistics rework. The design notes that explicitly. You need cheap commits to write column updates without amplification, and you need good per-column stats to keep reads fast once the data is split across base files and column files. The current draft scopes itself to updates that touch a column across all rows. Partial updates that touch a subset of rows are left for later work.

The design debate here is genuinely interesting, and it is not settled. Several contributors asked whether this belongs in Iceberg at all, or whether the right fix lives in Parquet. Parquet has a long-running effort to make its footer cheaper to read, including a proposal to replace the footer with FlatBuffers for dramatically faster reads. Parquet could introduce a concept of logical and physical files to manage a column-to-file mapping. The counterargument is that a column-to-file mapping inside Parquet starts to look like another manifest, which duplicates the job Iceberg already does. Other contributors pointed at how Lance, Hudi, and Paimon handle partial updates and column groups, and asked what Iceberg should borrow. One useful observation from the thread is that splitting a wide table into independently updated column families also reduces commit conflicts, since separate writers update separate families instead of serializing writes against one table.

This is the proposal that most clearly signals where Iceberg is heading. The format is being shaped to treat AI and machine learning data as a first-class workload, not a batch analytics afterthought.

Other proposals in the conversation

The five proposals above carry the most momentum, but they are not the whole v4 conversation. Several other ideas show up in the design documents and the dev list, and they are worth knowing about even if they are earlier in the process.

Multi-table transactions and catalog-level semantics come up often. Today an Iceberg commit is atomic for a single table. A pipeline that writes to several tables and needs all of them to commit together, or none of them, has to build that coordination itself. Many teams want a way to commit across tables atomically, so that a fact table and its related dimension tables move as one unit. This kind of catalog-level transaction would be transformative for complex pipelines, and it has been flagged as one of the most-watched horizon features. It is also one of the hardest to design, since it pushes transactional guarantees up from the table into the catalog, and the REST catalog spec would have to carry the new semantics. Expect this one to take time.

Refinements to the v3 types also continue. The variant type, added in v3 for semi-structured data, has room for richer operations and better statistics, and the column statistics rework feeds directly into making variant queries faster. The geospatial types added in v3 invite extended capabilities for spatial indexing and filtering. Row lineage, the feature that gives each row a persistent identity across commits, has open discussion about making incremental processing even cheaper. None of these are headline rewrites of the format. They are the steady tightening that happens once a feature ships and real workloads reveal the rough edges.

There is also ongoing work at the file-format layer that v4 depends on, even though it lives outside the Iceberg spec. The Parquet community is working to make the footer cheaper to read, including a proposal to replace it with FlatBuffers for faster metadata access. Parquet and Arrow are evolving for the AI era in parallel with Iceberg. The Summit paired the Iceberg metadata talks with sessions on evolving Parquet and Arrow for what comes next, since the table format and the file format have to move together. A faster Parquet footer makes columnar Iceberg metadata faster to read. Better Parquet support for column-level updates makes the column families proposal cleaner. The layers are coupled, and the communities coordinate.

Keep the maturity levels straight when you read about these. Single-file commits, Parquet metadata, typed statistics, relative paths, and column families have concrete design documents and active pull requests. Multi-table transactions and the type refinements are real conversations with less settled design. Treat the first group as the likely core of v4 and the second group as candidates that may land in v4 or may slip to a later version.

The convergence question: Iceberg v4 and Delta 5.0

No discussion of v4 is complete without the Databricks angle, since it reframes the whole conversation.

In the run-up to the Summit, Databricks announced that Iceberg v4 will rethink the core metadata structure with an adaptive metadata tree, and that Databricks is proposing Delta Lake 5.0 adopt the same structure. The pitch is convergence. One metadata layout that both Delta and Iceberg read and write directly. No translation layer like UniForm. No conversion tools like XTable. The two formats would sit on a shared on-disk foundation.

The technical claim is that Delta and Iceberg have already converged on the same ideas. Both moved to columnar metadata for efficient pruning. Both use manifest-style trees for scalability. Both adopted deletion vectors for fast updates. Yet today each maintains its own separate metadata structure, which duplicates effort and forces translation when you want to read one format from the other format's engine. Databricks proposes that Delta 5.0 adopt the Iceberg v4 metadata tree as its native content metadata. The result would be a single structure that clients of either format read and write with no conversion overhead.

If this lands, the practical effect is large. The word you pick, Delta or Iceberg, would describe history rather than architecture. Switching formats would cost nothing at the metadata layer. That changes the competitive picture for every vendor that built a business on format choice.

The context behind this matters. In June 2024, Databricks paid more than a billion dollars for Tabular, the company founded by the original creators of Iceberg. The revenue multiple was indefensible on paper. The strategic logic was exact. The acquisition brought the architects of Iceberg inside Databricks. Two years later, the people who built the open format that was positioned as the alternative to Databricks now help steer how Databricks governs that format. The firm shaping the convergence narrative is the firm that bought the right to shape it.

Here is the part to hold onto. Convergence is a proposal, not a decision. The Iceberg community has to accept the direction, and that acceptance is an open conversation. It is the kind of debate that plays out over months on the dev list, the same way the single-file commit details are being argued. A proposal from one large vendor, even a vendor that employs the format's creators, still has to win the community vote. The governance model is the whole point of Iceberg. No single vendor can unilaterally change the spec in ways that disadvantage the others. That is what makes the format trustworthy for long-term architecture decisions. The convergence idea will be tested against that model.

Why this is happening now: streaming, AI, and a maturing ecosystem

Step back and the pattern across all five proposals is one story. Iceberg outgrew its original design assumptions, and v4 is the format catching up to its own success.

The workloads tell the story. Streaming pipelines commit every few seconds, and the old metadata tree cannot tolerate that commit latency. The adaptive tree and single-file commits answer streaming. Machine learning produces tables with thousands of columns and constant small updates, and the old layout forces full rewrites. Column families and efficient column updates answer ML. AI retrieval needs index structures the old stats map cannot hold, and the column statistics rework answers vector search. Disaster recovery and cloud migration need portable tables, and relative paths answer portability. Each proposal maps to a workload that was rare or nonexistent when v1 shipped.

The ecosystem reached the maturity to support this push. A spec is only as useful as the tools that implement it, and Iceberg's tooling crossed a threshold. The REST catalog turned from a convenience into the connective tissue of the open lakehouse. Any engine, JVM-based or not, can work with Iceberg tables through one common interface. Apache Polaris graduated to an Apache top-level project on February 18, 2026, after incubating for 18 months with contributions from Google, Microsoft, Confluent, and many others. The catalog is becoming the control plane for governance, security, and multi-tenant access.

Iceberg is also no longer a JVM-only project. The Rust implementation now powers the native scan operator in DataFusion-Comet, bypassing Spark's JVM overhead. A C++ implementation is emerging for engines that need predictable memory and SIMD-optimized execution. PyIceberg crossed 500,000 daily downloads on PyPI, and teams run it in production without ever spinning up Spark. These are production-grade implementations, and they widen who can build on Iceberg and where it can run.

Multi-engine access became routine rather than aspirational. Spark handles ingestion while Snowflake, Trino, DuckDB, or Flink serve queries, and teams describe this as established architecture. The interoperability promise Iceberg made years ago is now operational reality across cloud boundaries. The net effect is that adopting Iceberg no longer demands a single monolithic technology choice. You pick the catalog that fits your governance model, the engine that fits your latency needs, and the language that fits your team, and the spec keeps them composable.

V4 is the format growing to match that reality. The proposals support AI and streaming workloads as first-class citizens, not as workarounds bolted onto a batch design.

What practitioners should do about v4 right now

The temptation with a horizon spec is to either ignore it or to over-anticipate it. Both are mistakes. Here is a grounded way to think about it.

Run v3 in production. It is the current standard, and it carries the features most teams actually need today, including deletion vectors, the variant type, geospatial types, and row lineage. Build new tables on v3 and get comfortable with its capabilities. Do not wait on v4 features that have no committed timeline.

Watch the proposals that map to your pain. If you run streaming ingestion and you fight commit latency and small files, the adaptive metadata tree is the proposal to track. If you run wide ML feature tables and you burn money rewriting rows to update a few columns, follow the efficient column updates work in issue 15146. If you operate across regions and you dread table migrations, relative paths will change your operational life. Knowing which proposal solves your specific problem tells you which dev list threads to read.

Pay attention to the catalog decision now, since it does not wait on v4. The catalog has become a Tier-1 architecture choice. It is the control plane for governance and the thing that determines whether your data can be governed, optimized, and shared consistently across engines. Pick the catalog that fits your governance model and keep your governance boundary clear. That decision compounds over years, and a wrong choice creates operational debt that grows with every table you add.

Follow the source, not the summaries. The authoritative status of any v4 feature lives in the Apache Iceberg GitHub repository, the design documents linked from the issues, and the dev mailing list. Blog posts and conference recaps, including this one, are a starting point. The vote happens in the open, and the spec is the final word.

Keep perspective on convergence. The Iceberg v4 and Delta 5.0 shared metadata story is real and worth understanding, but it is a proposal under community review, not a settled fact. Treat it as a direction to watch rather than a plan to build on.

The shape of what comes next

Iceberg v4 is not one feature. It is a coordinated redesign of the metadata layer, broken into proposals that each solve a concrete operational problem. The adaptive metadata tree makes commits cheap and fast. Parquet metadata makes planning fast as metadata gets richer. Typed statistics make stats reliable and extensible, and they open the door to vector search. Relative paths make tables portable. Column families make wide AI tables practical to update. The Delta convergence proposal asks whether two formats can share one foundation.

These proposals reinforce each other. Cheap commits enable column updates. Columnar metadata holds typed stats. The pieces fit because they came from the same insight. Iceberg succeeded so completely that people now push it far past its original design, and the format has to evolve to hold that weight.

The debates are not noise. They are the system working. The questions about scan cost in the adaptive tree, about whether column families belong in Parquet or Iceberg, about whether the community accepts convergence, these are the conversations that turn a good proposal into a durable spec. V4 will arrive after those arguments resolve, not before. That is slower than a single vendor shipping a feature, and it is exactly why the result will be worth building on.

For now, the practical advice holds. Run v3. Watch v4. Choose your catalog with care. And follow the work in the open, since the people building it are doing it where everyone can see.

Go deeper

If you want to understand the data lakehouse and the AI workloads reshaping it at the level this post only gestures at, the best next step is to read the books that cover it end to end. Alex Merced has written multiple hands-on books on Apache Iceberg, the agentic lakehouse, modern data architecture, and AI-assisted data work. They take you from the metadata internals through to building and operating real systems.

Pick them up at books.alexmerced.com and turn the concepts in this post into working knowledge.

DEV Community