Alex Merced

Posted on Jun 8

The State of Apache Iceberg Catalogs in June 2026

#architecture #data #database #dataengineering

The table format question is settled. Apache Iceberg won. Snowflake, Databricks, AWS, Google, and Microsoft all read and write it, and the open source engines treat it as the default. The interesting fight moved up one layer. The catalog is now the part of the stack that decides whether your lakehouse is governed, interoperable, and ready for the wave of AI agents that want to query it without a human in the loop.

This is not a small detail. The catalog resolves metadata, controls access, vends credentials, sequences commits, and acts as the single API boundary between every engine and every byte of data you own. Pick the wrong one and you inherit operational debt that grows with each table. Pick well and you get engine freedom, one governance model, and a clean path as the spec evolves.

June 2026 is a useful moment to take stock. Apache Polaris graduated to a top-level project in February. Snowflake Summit just wrapped with Iceberg v3 going generally available and a Polaris-powered governance layer at the center of the keynote. Databricks set the table for its own summit with a blunt claim that Unity Catalog is the most interoperable Iceberg catalog on the market. A two-year-old Iceberg operations startup got acquired by a security company valued at nine billion dollars. The pieces are moving fast, so here is a clear-eyed map of where every catalog stands, what it does well, where it falls short, and what shipped recently.

What an Iceberg Catalog Actually Does

An Iceberg table is a pile of Parquet files, metadata files, and manifest lists sitting in object storage. On its own it is inert. The catalog answers the one question that makes it queryable: where is the current metadata.json for this table? Without that pointer, no engine reads or writes anything.

Modern catalogs do far more than resolve pointers. They enforce who can read, write, or administer each table and namespace. They vend short-lived, table-scoped storage tokens so engines never hold long-lived cloud keys. They sequence concurrent writers with server-side deconflicting instead of fragile client-side locking. They organize tables into namespaces, track view definitions, and serve as the single point for lineage and audit. The catalog is where governance lives. Everything an engine does passes through it.

The reason this got interesting in 2026 is the Iceberg REST Catalog specification. Before REST, every engine needed a dedicated connector for every catalog. Spark talked to Hive Metastore one way, Trino talked to Glue another way, and custom tooling talked to an internal catalog a third way. Adding an engine or a catalog meant writing integration code for every pairing. REST collapses that. Implement the REST client once per engine, implement the REST server once per catalog, and the whole thing interoperates over plain HTTP.

The protocol also opened the door to server-side capabilities the old Thrift-based approach made impossible. Credential vending scopes a leaked token to one table for a few minutes. Remote signing goes further, so the engine never touches credentials at all and the catalog pre-signs each file access. Server-side commit deconflicting retries conflicts on the server. Multi-table commits give atomic visibility across several tables at once. The newest addition is scan planning. The Iceberg 1.11 release added a REST scan planning client, which lets the catalog plan a scan on the server and hand back a filtered plan. That single feature is the foundation for cross-engine access control, because the catalog can apply row filters and column masks during planning and return only the rows an engine is allowed to see.

Scan planning is the feature to watch this year, so it is worth slowing down on. In the old model, an engine asked the catalog for a table’s metadata, then planned the scan itself by reading manifest files and deciding which data files to touch. The engine saw everything. Server-side scan planning flips that. The engine asks the catalog to plan the scan, and the catalog reads the metadata, applies whatever row filters and column masks the policy says this caller is allowed, and returns a plan that points only at authorized data. The engine never sees what it is not permitted to see, because the filtering happened before the plan existed. That is how a single set of policies, defined once in the catalog, gets enforced across Spark, Trino, DuckDB, and anything else that implements the client. It also offloads expensive planning work from the engine to the catalog, which caches it. Gravitino, Databricks, and Snowflake all built features on this in the last few months, and it is the technical backbone of cross-engine governance.

Remote signing deserves the same attention for sensitive data. With credential vending, the catalog hands the engine a short-lived token scoped to a table. With remote signing, the engine gets no token at all. Every individual file read is pre-signed by the catalog, scoped to one file and one operation. For regulated data where even a few minutes of broad access is unacceptable, that difference matters, and the catalogs that support it, Polaris, Lakekeeper, and others, are starting to align on the Iceberg 1.11 signer endpoint properties so engines configure it the same way everywhere.

Every catalog released after 2023 either speaks REST or is racing to add it. The question is no longer whether to use the protocol. The question is which REST implementation fits your stack, and that is what the rest of this piece works through.

Iceberg v3 Lands, and v4 Is Already on the Whiteboard

Two format milestones frame the catalog story this year.

Iceberg v3 reached general availability across the major platforms in the first half of 2026. It adds deletion vectors, which speed up updates, merges, and deletes by marking deleted rows instead of rewriting files. It adds row tracking, which makes incremental processing far cheaper. It adds the VARIANT type, a standard way to store semi-structured data so JSON-shaped payloads stop forcing awkward workarounds. Snowflake, Databricks, and Amazon S3 Tables all confirmed v3 support as generally available, and the catalogs that store the metadata followed. This matters for catalogs because v3 features ride through the catalog API, and not every catalog supports creating v3 tables yet. AWS Glue, for example, still cannot create v3 tables through its REST CreateTable path even though EMR and Glue ETL can work with them.

The next frontier is already public. Databricks used its pre-summit blog to announce that Iceberg v4 will rethink the core metadata structure with an adaptive metadata tree, and that it is proposing Delta 5.0 adopt the same structure. The pitch is convergence: one metadata layout that both Delta and Iceberg share, ending the long trade-off between interoperability and production-grade performance. Whether the Iceberg community accepts that direction is an open conversation, and it is the kind of debate that plays out over months on the dev list. For now, treat v3 as the production target and v4 as the horizon worth watching.

Snowflake Summit 2026: Horizon Catalog, Powered by Polaris

Snowflake Summit 2026 ran the first week of June, and the catalog news sat at the center of the keynote rather than buried in a breakout.

The headline is that Horizon Catalog, Snowflake’s governance and discovery layer, now runs its interoperability on Apache Polaris and enables bi-directional read and write access to Snowflake-managed Iceberg tables from outside engines. That is a real shift. For years, “open” often meant external engines could read Snowflake data but not write it. The bi-directional write path closes that gap. An external Spark or Trino job can now write to a Snowflake-managed Iceberg table through Polaris-implemented open APIs, with Snowflake’s governance applied through the Iceberg REST Scan Plan API so fine-grained protections travel across compatible engines.

It helps to keep two Snowflake products straight, because the naming confuses people. Snowflake Open Catalog is the managed Apache Polaris service for externally managed Iceberg tables, aimed at cross-engine interoperability with zero self-hosting. Snowflake Horizon Catalog is the governance and discovery layer for Snowflake-managed assets, and its interoperability layer is now built on the same Polaris engine. Snowflake has been explicit that it runs the same Polaris backbone the community downloads, not a stripped-down fork. That is a meaningful commitment in a space where “open” has been used loosely.

Around the catalog, Snowflake added Horizon Context for an AI and BI context layer, Semantic Studio and Semantic View Autopilot for building shared business logic, and Adaptive Compute for matching resources to AI workloads. It also folded its Natoma acquisition into a set of agent identity and security features. The analyst read from Constellation Research was sharp: Iceberg v3 is table stakes, and the real story is read and write interoperability plus governance, trust, and context for agents. The format war is over, so the platforms are competing on meaning and control instead.

Databricks Sets the Stage for Its Own Summit

Databricks holds its Data + AI Summit from June 15 to 18, so the biggest stage-show announcements land the week after this writing. The company did not wait, though. It published a detailed Unity Catalog and Iceberg post on May 28 that reads like a marker planted firmly in the ground.

The claim is direct: Unity Catalog is the most complete and interoperable Iceberg catalog available, and the proof is a batch of capabilities moving to general availability. Managed Iceberg is GA, so you create, read, write, optimize, govern, and share Iceberg tables directly in Unity Catalog with Predictive Optimization and Liquid Clustering handling the tuning. Iceberg v3 is GA, with deletion vectors, row tracking, and VARIANT across managed, foreign, and UniForm-enabled tables. Foreign Iceberg is GA, along with credential vending for foreign Iceberg, so Unity governs and securely queries tables that live in other catalogs. External sharing to Iceberg clients is GA through the open Delta Sharing protocol, with foreign Iceberg sharing in public preview.

Databricks framed the pitch around five requirements it says define a real Iceberg catalog: open APIs with credential vending, federation across external estates, cross-engine governance, secure and open sharing, and continuous performance and format innovation. The cross-engine governance piece is the technically interesting one. Cross-engine attribute-based access control is in beta, and it works by enforcing column masks and row filters during server-side scan planning through the Iceberg REST scan APIs. Any engine that implements the scan planning client from Iceberg 1.11, such as Spark or DuckDB, gets the same policies applied without a Databricks runtime. New federation connectors in preview extend Unity beyond Glue, Snowflake Horizon, Hive Metastore, and Salesforce Data Cloud to include Google Cloud Lakehouse and Palantir.

The honest read on Databricks is the same as it has been. The managed Unity Catalog is excellent and deeply tied to the Databricks platform. The open source Unity Catalog under Linux Foundation governance is a separate, slower-moving project with a real feature gap, and you should not assume parity between the two.

Apache Polaris: The Community Standard Comes of Age

Apache Polaris is the catalog that gained the most ground in the last year, and the trajectory is worth laying out.

Snowflake and Dremio co-created Polaris and donated it to the Apache Software Foundation in August 2024. It incubated for 18 months with contributions from Google, Microsoft, Confluent, and dozens of other organizations, and it graduated to an Apache top-level project on February 18, 2026. The 1.0 release shipped in October 2025 with external identity provider support for Okta and Google, a persistent policy store for things like compaction and snapshot expiration, and a downloadable binary plus Helm chart. The 1.4 release in April 2026 was the first post-graduation drop, and it pushed hard on production hardening: storage-scoped AWS credentials, AWS STS session tags so CloudTrail can correlate access, S3 KMS encryption support, CockroachDB as a persistence backend, and Iceberg metrics persistence to the database.

What Polaris does well is the core a vendor-neutral catalog needs. It implements the Iceberg REST spec fully, including credential vending, server-side deconflicting, multi-table commits, and OAuth2. Its access model uses a clean hierarchy of principals, principal roles, and catalog roles, which decouples identity from permissions and enforces security at the catalog layer no matter which engine runs the query. A single Polaris server manages many logical catalogs, each with its own storage and keys. Catalog federation lets one Polaris instance route to Hive Metastore, Glue, and other Iceberg REST endpoints, so you adopt it incrementally instead of doing a big-bang metadata migration. Generic Tables register non-Iceberg assets like Delta and Hudi alongside Iceberg tables in the same namespace, and the same feature opens a path to storing semantic assets like metric definitions in the catalog itself. Open Policy Agent integration is maturing for teams that want external authorization.

The recent pull request activity shows where the project is putting its energy. In early June the community merged a credential vending refactor in core, added support for access delegation in registerTable, and moved event listeners onto a dedicated thread pool so the audit and change-event path does not block commits. There was also cleanup that says a lot on its own: a fix removing the incubator path segment from binary distribution URLs, the small chores that follow graduation. The forward work the community keeps discussing is the Table Sources proposal, which aims to turn Polaris into a registry for every lakehouse asset, not just tables and views but functions, metrics, and models. If that ships, the catalog becomes the single place every team and every agent looks for governed, semantically rich data.

The honest limits are real. Polaris is a Quarkus-based JVM service, so the open source path means you run and scale it yourself along with a PostgreSQL, MySQL, or CockroachDB backend. It has no Git-style branching the way Nessie does. And the line between the Apache project and Snowflake’s commercial Open Catalog can blur, so feature parity between the two is not guaranteed.

Project Nessie: Git for Your Catalog

Project Nessie, created by Dremio, takes a different angle that nothing else on this list matches. It brings Git-like semantics to catalog metadata. You create branches, tags, and commits over the entire catalog state, which lets you run isolated experiments, build CI/CD workflows for data, and roll the whole catalog back to a previous commit.

The branching is the point. You spin up dev, staging, and feature branches of your catalog, write to a branch in isolation, then merge when the work is ready. That is genuinely useful for testing schema changes, validating a backfill, or doing feature engineering against production data without touching live tables. Catalog-level time travel gives you a global undo across every table at once, not just per-table snapshots. Merges provide atomic visibility, and cherry-pick works exactly like it does in Git. Nessie implements the Iceberg REST interface, so engines connect over the standard protocol, and the 0.107.5 release in April 2026 added Spark SQL 4.0 extensions for branch and tag management.

The limits keep Nessie in a specialist role rather than a default. It has no built-in fine-grained access control, so production deployments pair it with Polaris, an OPA layer, or a custom authorization service. It does not vend credentials, so engines bring their own storage access. And the branching itself is only worth the operational overhead if your workflows actually benefit from data CI/CD. For a team that just needs metadata resolution and access control, branch management is complexity without payoff. The merges also provide atomic visibility rather than true multi-statement ACID, which is a distinction worth understanding before you design around it.

Apache Gravitino: The Federated Metadata Lake

Apache Gravitino is the most ambitious project in this group, and it frames itself as more than an Iceberg catalog. It calls itself a federated metadata lake, a single layer for tables, files, models, Kafka topics, and UDFs across many backend systems. It graduated to an Apache top-level project in June 2025, shipped 1.0, and reached 1.2.0 on March 13, 2026.

The breadth is the selling point. Gravitino connects to Hive, MySQL, PostgreSQL, HDFS, S3, Iceberg, Hudi, Paimon, ClickHouse, StarRocks, OceanBase, and more through one API, with changes reflected through direct connectors instead of ETL-based metadata sync. It runs a native Iceberg REST endpoint so any REST-compatible engine treats it as an Iceberg catalog. The 1.2.0 release added a Table Maintenance Service that schedules table health work proactively, a ClickHouse catalog for governing real-time analytics next to the lakehouse, end-to-end UDF management, authorization for Iceberg view operations, a redesigned web UI, and scan planning offload so engines like DuckDB and Spark delegate planning to Gravitino’s IRC server. The project also leaned into AI-native metadata in 2025 with a Model Catalog, an MCP server to connect agents to data context, and a Lance REST service for vector data.

The recent pull requests reinforce the federation-first identity. In early June the community merged Flink connector view support for Iceberg and Paimon catalogs, a Glue catalog UI in the new web console, support for complex types in Iceberg tables managed through Glue, and REST catalog backend HTTP timeout configs. These are the connector and integration fixes a project ships when its job is to sit in front of many systems at once.

The limits follow from the ambition. Documentation lags the feature set, especially around production hardening. Running Gravitino means operating a JVM server, its connector layer, and the federation topology, which is a large configuration surface. Engine integration is most mature for Trino, with Spark and Flink progressing but not at parity. And if you only need an Iceberg catalog, Gravitino is more machine than the job requires.

Lakekeeper: The Lightweight Rust Option

Lakekeeper is the youngest catalog here and the most opinionated about staying small. It is written entirely in Rust and ships as a single binary with no JVM and no Python. Point it at a PostgreSQL database and it serves REST requests in milliseconds, which makes it a natural fit for containers and Kubernetes.

It implements the full Iceberg REST spec, including multi-table commits, server-side deconflicting, and table and view statistics. Storage access uses vended credentials and remote signing across S3, GCS, ADLS, and on-premise S3-compatible stores. Authorization runs on OpenFGA by default with an OPA bridge for Trino, and authentication accepts any OIDC provider plus native Kubernetes service account auth. A single deployment serves many isolated projects and warehouses, and built-in CloudEvents emission lets you react to table changes by triggering compaction or feeding a CDC pipeline. The 0.12.0 release in April 2026 concentrated on authorization, adding an audit event handler with exactly-once guarantees, OPA batch optimization, Trino custom rule extensions, configurable admin users, and better role lifecycle management.

The recent pull requests show the same focus sharpening. In early June the project added a role-membership backend with role-in-role nesting and bounded nesting depth at write time, published support for Cedar policies including a global_role_ids requirement, and started emitting the Iceberg 1.11 signer.uri and signer.endpoint properties so remote signing lines up with the latest spec. There was also a fix to retry transient failures when acquiring storage OAuth tokens, the kind of reliability work that matters at scale.

The limits are mostly about maturity and scope. It is a young project with a smaller community, so production deployment stories are still accumulating. It has no branching. PostgreSQL is the backing store unless you implement the storage trait yourself. And it has been validated most with Spark, PyIceberg, Trino, and StarRocks, with Flink and Hive less proven. For teams that want a fast, dependency-light catalog with strong authorization, though, it is a strong pick. A commercial Lakekeeper Plus edition from Vakamo adds enterprise maintenance and snapshot management, and Red Hat certified it for OpenShift.

Unity Catalog Open Source: The Other Half of the Story

The managed Unity Catalog is a Databricks product, but the open source Unity Catalog is its own project under Linux Foundation governance, and it deserves a separate look because the two move at different speeds.

The open source pull request activity in late May and early June tells you the project is converging on Delta-first managed tables while keeping the Iceberg REST path. Recent merges made the Delta REST API enabled by default, enabled managed tables by default with server.managed-table.enabled=true, added support for column default values, enforced case-insensitive Delta column names, and turned on credential-scoped filesystem access by default in the Spark connector. A run of changes renamed and tightened the Delta API contract. The direction is a more opinionated, batteries-included server that works out of the box rather than requiring deep configuration.

The takeaway holds steady. If you run Databricks, the managed Unity Catalog is the natural and often mandatory choice, with Predictive Optimization, Liquid Clustering, and AI asset governance you do not get elsewhere. If you run the open source version off-platform, expect a real feature gap and plan around it.

The Managed and Cloud-Native Catalogs

Self-hosting is not the only path, and for many teams it is the wrong one. The cloud providers all ship managed catalog services that trade portability for zero operations.

Snowflake Open Catalog is the managed Apache Polaris service. You get the same REST API, RBAC, and credential vending as the open source project with nothing to host. It is generally available and free today, with pay-per-request billing planned for later in 2026. For teams that want Polaris without operating a JVM service, it is the path of least friction, and it stays vendor-neutral because the underlying project is.

AWS gives you two related options. The AWS Glue Data Catalog is the long-standing managed, serverless metadata service, deeply tied to IAM, Lake Formation, Athena, EMR, and Redshift. It added an Iceberg REST endpoint in late 2024, so external engines connect without Glue-specific SDKs. The limits are well known: it is AWS-only with no built-in cross-cloud federation, it supports a single level of namespace nesting, it has no branching or multi-table commits, and its REST surface has gaps. UpdateTable is not supported for Iceberg tables through the REST API, v3 tables cannot be created through the REST CreateTable path, and the REST endpoint does not vend credentials. The newer option is Amazon S3 Tables, which are first-class AWS resources that expose the Iceberg REST Catalog API and deliver up to ten times higher transactions per second than Iceberg tables in general-purpose buckets. S3 Tables now support Iceberg v3, include table-level access control and built-in maintenance, and integrate with SageMaker Lakehouse for unified governance and fine-grained access control. The open source S3 Tables Catalog client library bridges the control-plane operations to engines like Spark.

Google BigLake Metastore is a serverless, managed Iceberg REST catalog on GCP. It supports interoperability between Spark, Trino, and BigQuery on the same tables in Cloud Storage, and it includes BigQuery federation so a table created in Spark is queryable in BigQuery without a copy. Microsoft Fabric OneLake Catalog manages metadata for tables across Fabric workspaces with Delta and Iceberg support, tightly bound to the Fabric platform.

Streaming sources are part of this picture too, and they are easy to forget. Confluent’s Tableflow materializes Kafka topics directly as Iceberg tables and registers them in a catalog, so the data an application produces lands in the lakehouse as a governed Iceberg table without a separate batch pipeline. Confluent was one of the original Polaris contributors, and the pattern matters because it means the catalog is no longer fed only by batch ETL. Real-time data writes straight into it. Any catalog you choose has to handle a write path that includes streaming ingestion, not just nightly jobs, and the ones with server-side commit deconflicting handle the concurrent writes that streaming produces far better than the ones without it.

Dremio also offers a managed Polaris-based catalog as part of its platform, called Open Catalog. It gets its own section below, because the changes there over the last six months are substantial enough to treat on their own.

For completeness, the Iceberg project also ships a JDBC catalog that stores metadata pointers in any JDBC-compatible database. A SQLite-backed JDBC catalog is excellent for local development, unit tests, and CI because it needs no cloud services. A PostgreSQL-backed one works for single-writer or moderate-concurrency production. It is not a REST catalog, though, so engines need JDBC drivers on the classpath, and you get no credential vending, no server-side deconflicting, and no multi-table commits. Treat it as a stepping stone, not a destination.

Dremio: The Agentic Lakehouse Built on Polaris

Dremio sits in an unusual spot in this map. It co-created Apache Polaris and Apache Arrow, it is one of the most active Polaris contributors, and its Open Catalog uses Polaris at the core rather than a separate fork. So when you adopt Dremio’s catalog, you adopt the same open standard the community governs, with Dremio’s platform built around it. That framing matters for what changed over the last six months, because Dremio spent the period turning its catalog from a managed metadata service into the center of an autonomous, agent-first platform.

The repositioning came at the Subsurface conference in November 2025, when Dremio relaunched Dremio Cloud as “the Agentic Lakehouse,” described as built for agents and managed by agents. The pitch puts AI agents as a first-class operator of the platform rather than a copilot bolted onto the side, and the catalog is the foundation it all sits on. Through the first half of 2026 the company shipped the pieces that back the claim.

Start with the catalog itself. Open Catalog is managed Polaris, provisioned the moment you start, so you get RBAC, credential vending, and the Iceberg REST spec without operating a JVM service. Dremio extends it with fine-grained access control through UDFs, which adds row-level security and column masking that travel with the data across every access path, not just inside one engine. Its query federation engine connects databases, warehouses, and external catalogs such as PostgreSQL, Snowflake, BigQuery, Glue, and Unity Catalog into the same governed namespace, so the catalog governs more than Iceberg tables. On top, the AI Semantic Layer lets teams build curated SQL views in Bronze, Silver, and Gold tiers with wikis, tags, and AI-generated metadata, which is the business context an agent needs to turn a vague question into a correct query.

The autonomous side is where the last six months added the most. Dremio Cloud now runs an active metadata system that watches query patterns, data relationships, and usage trends to make optimization decisions on its own. It automatically builds performance materializations through Reflections and rewrites incoming SQL in real time to hit sub-second response. It reorganizes physical data layouts through automated clustering based on access patterns. And it runs compaction and table maintenance on the Iceberg tables in the catalog without a human scheduling the jobs. This is the same operational layer the rest of this piece keeps pointing at, the work catalogs historically do not do, folded directly into the platform.

Two open-standard milestones in the window reinforced the position. Polaris graduated to a top-level Apache project in February 2026, which hardened the open core under Dremio’s Open Catalog, and Dremio used the moment to highlight new community appointments and its continued contribution pace. In April 2026, Dremio brought Iceberg v3 support to general availability in Dremio Cloud, putting deletion vectors, row tracking, and VARIANT in reach for its users at the same time the other major platforms shipped v3. The company also leaned on its own research, a 2026 State of the Data Lakehouse and AI report, where 65 percent of organizations named agentic analytics a top priority for the year and 70 percent pointed to siloed data and weak governance as the main obstacles to getting value from AI. That data is the argument for the whole agentic pitch.

The agent connectivity story is worth calling out on its own. Dremio Cloud natively supports the Model Context Protocol, so any MCP-enabled agent from Anthropic, OpenAI, or Google connects to the catalog and semantic layer through a standard interface. It also ships its own AI Agent for business users and analysts to ask questions and get answers and visualizations directly. Both paths read the same governed catalog and the same semantic definitions, which is the point of putting meaning in the catalog rather than in each tool.

The honest framing is the same one that applies to every managed platform here. Dremio’s value is the integration: catalog, federation, semantic layer, autonomous optimization, and agent access in one place, so you do not assemble five tools and wire them together. The trade is platform coupling. The mitigating factor specific to Dremio is that the catalog core is open Polaris and the tables are open Iceberg, so the lock-in is lighter than a proprietary catalog and you can point other engines at the same data. For teams that want the autonomous and agentic capabilities without building them, that integration is the draw. For teams that want only a bare catalog, Open Catalog is more platform than the job needs, and self-hosted Polaris is the leaner path.

Here is the thing almost every catalog comparison skips. None of these catalogs tell you whether a table is healthy. They resolve pointers and enforce access. They do not track orphan files piling up, manifests that need consolidation, snapshot history eating storage, or a compaction schedule falling behind ingestion. A catalog tells you where the data is and who can touch it. It does not keep the data fast.

That gap is closing from two directions, and watching how is one of the clearest signals about where the market is going.

The first direction is catalogs absorbing maintenance. Gravitino 1.2.0 shipped a Table Maintenance Service. Databricks built Predictive Optimization and Liquid Clustering into Unity Catalog so maintenance runs based on access patterns. AWS S3 Tables include automatic compaction. Polaris added a policy store for compaction and snapshot expiration in 1.0. The catalog is slowly becoming the place where table health gets managed, not just where metadata lives.

The second direction is a dedicated operational tier that sits next to the catalog. This is where the year’s most telling acquisition comes in.

What the Ryft Acquisition Signals

On April 23, 2026, Cyera acquired Ryft. The price was not disclosed, but Israeli press put it between 100 and 130 million dollars, a strong return on Ryft’s eight million dollar seed round and a notable outcome for a company founded only in 2024.

Ryft built an automated Iceberg management platform. It monitored an entire Iceberg lakehouse, detected tables with too many small files or partition schemes that forced wasteful scans, and ran compaction and layout optimization based on actual usage patterns, with claims of cutting query times and costs by up to ten times. It also handled snapshot lifecycle policies, automated data retention, and GDPR-style compliance deletion, the operational chores that keep a lake healthy and audit-ready. In early 2026 it added a Lakehouse Context Layer that turned the signals it already collected, schema, query patterns across engines, freshness, and statistics, into agent-readable context for every table.

Cyera is not a data analytics company. It is an AI security platform valued at nine billion dollars after a recent 400 million dollar Series F, focused on data security posture management for the age of autonomous agents. It bought Ryft to extend its control plane into the data lake layer, where agents increasingly operate, and Ryft’s CEO is now leading AI security efforts at Cyera. Read that again. A security vendor paid nine figures for an Iceberg operations startup so it could give AI agents traceable, governed, secure access to lakehouse data.

That tells you two things. Iceberg table operations, the compaction and lifecycle work catalogs do not handle, is now valuable enough that a security giant pays a premium for it. And the reason is agents. The lakehouse is becoming the place agents read and write, and whoever controls the operational and security layer around the catalog controls how safely that happens. Independent operational vendors like LakeOps make the same bet from a different angle, connecting to existing catalogs and adding autonomous maintenance on top. The catalog resolves metadata and access. Something else has to keep the tables healthy and keep the agents honest. That layer is now contested ground.

The Catalog Is Becoming the AI Control Plane

Step back from the individual products and a pattern is obvious. Every catalog roadmap in 2026 is bending toward AI agents, and the bending is reshaping what a catalog is.

Start with what agents need. A human analyst who writes a wrong query notices the result looks off and fixes it. An agent querying tables at high frequency, without review, does not. It needs the catalog to carry enough context that a generic question produces a correct answer: what a metric means, how a table is joined, which rows a caller is allowed to read. That pushes three things into the catalog that used to live elsewhere.

The first is semantics. Polaris stores Iceberg SQL view definitions, so the meaning of “active customer” lives in the catalog and every engine reads the same definition. Its Generic Tables feature lets teams register metric definitions, ownership, and lineage as governed assets next to the data. The Table Sources proposal aims to extend that to functions, metrics, and models. Snowflake added Horizon Context and Semantic Studio for the same reason. The catalog is turning into the place business meaning is stored, not just table locations.

The second is machine-readable access. Gravitino shipped an MCP server in 2025 so agents connect to data context through the Model Context Protocol, and a Model Catalog and Lance REST service for vector data. The acquired Ryft platform built a Lakehouse Context Layer that turned table usage signals into agent-readable context. The direction is the same across vendors: the catalog should expose itself to an agent the way it exposes itself to a query engine, through a standard interface that carries context, not just metadata.

The third is governance that holds when the caller is not a person. Cross-engine attribute-based access control through scan planning is the clearest example. When an agent shifts identity based on the task and the chain of delegation, as Cyera described when it bought Ryft, the old model of trusting the engine breaks down. Enforcing row filters and column masks during server-side planning means the policy holds no matter which agent or engine asks. That is why a security company paid nine figures for an Iceberg operations startup. The catalog and the layer around it are becoming the control plane for how agents touch enterprise data, and whoever owns that owns a lot.

This is the real reason the catalog question got urgent. A catalog used to be plumbing. In an agent-driven lakehouse it is the place trust, meaning, and access all converge, and the products are racing to become that convergence point.

For all the progress, two hard problems sit unsolved across the field.

The first is governance portability. Access control policies live in the catalog, and there is no industry standard for sharing them across catalogs. Set up row-level security in Unity Catalog and that policy does not transfer to Polaris. Define namespace grants in Polaris and they do not apply when the same table is read through Glue. The practical answer most architects reach is to pick one catalog as the governance boundary and route every engine through it, rather than running several catalogs with duplicated and inevitably inconsistent rules. Federation features in Polaris, Unity, and Gravitino help by centralizing the access layer even when metadata lives in distributed backends, and the Iceberg REST scan planning APIs are starting to make cross-engine policy enforcement real. But there is still no portable policy format, and until there is, multi-catalog governance stays a manual, error-prone job.

The second is the gap between open and managed. Every major vendor now ships an open source catalog and a managed one, and the managed version is consistently more capable. Unity Catalog open source trails the Databricks version. Snowflake and Dremio Open Catalogs tracks Apache Polaris closely, which is the healthiest case, but the surrounding Horizon Catalog features are Snowflake’s own. The word “open” carries weight in this space, and the careful move is to check whether the open project is the same code the vendor runs in production or a slower sibling. Polaris graduating to a top-level project with Snowflake stating it runs the same backbone is the strongest version of that promise so far. It is also the exception worth holding others against.

The third is operational reliability, and it is the one teams underestimate until it bites. The catalog is a Tier-1 dependency. If it goes down, no engine resolves metadata, and every read and write across the lake stops at once. That is a different blast radius than a single failed query. The catalogs vary widely in how ready they are for this. The managed services handle availability for you, which is most of why teams pick them. The self-hosted options put it on you: run the JVM service or the Rust binary with replication, back up the persistence layer, monitor P99 latency with a target under half a second, and plan failover before you need it. The newer projects have fewer battle-tested deployment stories, which is a real consideration for a service this central. Whatever you choose, treat the catalog with the same seriousness you treat a production database, because functionally that is what it is.

How to Choose in 2026

There is no single right answer, and anyone who tells you otherwise is selling something. The choice comes down to your constraints, your existing stack, and which trade-offs you accept.

If you live entirely on AWS and want zero operations, Glue or S3 Tables is the path of least resistance, and you accept the cloud coupling. If you want a vendor-neutral, multi-engine, multi-cloud catalog and you are willing to run a JVM service or use a managed Polaris offering, Apache Polaris is the community standard, available self-hosted, through Snowflake Open Catalog, or as the core of Dremio’s Open Catalog. If your workflows need branch isolation and data CI/CD, Nessie is the only option for Git-style version control, and you pair it with a policy layer for production security. If you are a Databricks shop, Unity Catalog is the natural and usually mandatory choice. If you have a heterogeneous platform with Hive here, PostgreSQL there, and Kafka somewhere else, Gravitino unifies the metadata under one API. If you want a fast, dependency-light catalog on Kubernetes with strong authorization, Lakekeeper is the cleanest pick. On GCP, BigLake Metastore is the managed default. And for local development, the SQLite JDBC catalog costs nothing and runs anywhere.

For most organizations the realistic path is not one catalog forever. You run Glue for existing AWS workloads, add Polaris for multi-engine access, and use Nessie for a development environment that needs branch isolation. The REST protocol makes that coexistence practical, and federation in Polaris, Unity, and Gravitino makes it manageable.

If there is one position worth holding firmly, it is this: bet on a REST-compatible implementation. Start with REST and you can swap catalog backends later without touching engine configuration. Start with the old Thrift-based Hive Metastore and you inherit a migration the day you outgrow it. That flexibility is worth more than any single feature on any single vendor’s slide.

The format war ended. The catalog war is just getting good. By the time Databricks finishes its summit on June 18, the v3 wave will be fully GA, the v4 and Delta 5.0 convergence debate will be in full swing, and agents will be querying more of these tables than people are. The teams that win the next two years are the ones who treat the catalog as the Tier-1 decision it has become, keep their governance boundary clear, and remember that resolving metadata is only half the job. Keeping the tables healthy and the agents accountable is the other half, and that half is still up for grabs.

DEV Community