DEV Community

Aki for AWS Community Builders

Posted on

What Is Apache Polaris? Why Open Data Catalogs Matter and How to Use Them with AWS

Original Japanese article: Apache Polarisとは何か?オープンなデータカタログが求められる理由とAWSとの組み合わせ方を整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In recent years, lakehouse architectures centered around Apache Iceberg have been rapidly expanding.

By placing Iceberg tables on object storage such as S3, it has become possible to query the same data from multiple engines such as Athena, Snowflake, Spark, Trino, and Dremio.
As a result, the discussion has shifted from “Where should data be placed, and which engine should be used for analysis?” to “Where should data ownership reside, and which catalog should be used to unify governance?”

Amid this trend, Apache Polaris has been attracting attention in recent years.
Apache Polaris is an open-source implementation of the Iceberg REST Catalog, led by Snowflake and donated to the Apache Software Foundation.

Multiple vendors—including Dremio, AWS, Google, Microsoft, and Confluent—are contributing to it, and it is positioned as an “open catalog” that enables cross-platform management of Iceberg tables while avoiding vendor lock-in.

In this article, I would like to think through the following:

  • What Apache Polaris is
  • Why open data catalogs are required
  • Differences from AWS Glue Data Catalog
  • Differences from Snowflake Horizon Catalog
  • How responsibilities should be divided when combining with AWS

In conclusion, Apache Polaris is not something that competes with AWS Glue Catalog or Snowflake Horizon Catalog; rather, they are catalogs that operate at different layers.

It may be easier to understand Apache Polaris as a component that enables an architecture such as:
“The data itself resides in AWS, the catalog is open, and analysis engines are selected based on use cases.”


What is Apache Polaris?

Apache Polaris is an open-source catalog implementation compliant with the Apache Iceberg REST Catalog specification.
It was announced by Snowflake in 2024 and later became an incubation project under the Apache Software Foundation.

Official site:
https://polaris.apache.org/

What Polaris aims to achieve is a common metadata and governance foundation in a lakehouse centered around Iceberg tables.

A major characteristic is that it is not tied to any specific query engine or cloud vendor, and anyone can access it using the same specification via REST APIs.


Key Features of Apache Polaris

Feature Description
Implementation of Iceberg REST Catalog Accessible via standardized REST APIs. Can be directly used from engines such as Spark, Trino, Flink, Snowflake, and Dremio
Multi-catalog architecture Multiple catalogs can be defined within a single Polaris instance. Enables separation and management by team or business domain
RBAC (Role-Based Access Control) Provides a permission model combining principals, principal roles, and catalog roles
External catalog integration Can connect to other catalogs compliant with the Iceberg REST specification (e.g., Nessie, Gravitino)
OSS / Managed support Can be self-hosted as OSS, or used as managed offerings such as Snowflake Open Catalog or Dremio Catalog

What Apache Polaris Solves

As Apache Iceberg has become more widely adopted, multiple Iceberg-compatible catalogs have emerged, including Hive Metastore, JDBC, Nessie, AWS Glue, and Snowflake.

Since each has its own client libraries and interfaces, the following challenges have arisen:

  • The need to implement catalog clients for each programming language
  • Inconsistent access control specifications across catalogs
  • Difficulty enforcing governance across multiple catalogs
  • As a result, the overall architecture becomes constrained by the chosen catalog

To solve these challenges, the Iceberg REST Catalog specification was introduced.
Apache Polaris is an open-source implementation of that specification, further enhanced with multi-catalog support and RBAC.

In other words, you can think of it as an open catalog for Apache Iceberg.


Polaris Security Model

The Polaris security model can be organized into the following three concepts:

  • Principal: An entity representing a user or service. Accesses Polaris via client ID/secret, etc.
  • Principal Role: A grouping of multiple catalog roles. Assigned to principals
  • Catalog Role: A set of permissions within a specific catalog. Includes permissions such as TABLE_READ_DATA, TABLE_CREATE, and NAMESPACE_LIST

For example, you can design it such that:

  • The data_engineer principal role is assigned both write access to prod_catalog and administrative access to dev_catalog
  • The data_analyst principal role is assigned only read access to prod_catalog

An important point is that RBAC is centralized on the catalog side, eliminating the need to implement access control separately for each engine.


Why Open Data Catalogs Are Required

Let us first consider why open data catalogs are required in the first place.


Separation of Data and Engines Has Become a Premise

The greatest value of open table formats such as Apache Iceberg is the ability to separate data storage from query engines.

It has become possible to freely choose engines such as Athena, Glue, Spark, Snowflake, Dremio, and DuckDB depending on the use case when querying Iceberg tables on S3.

As a result, the key question in data platforms has shifted from “Which product should we use?” to “Where should data ownership reside, and who should be responsible for governance at which layer?”

However, while engines can now be freely selected, the remaining challenge is the catalog.


What Happens When Catalogs Are Tied to Engines

When using catalogs tightly coupled with query engines, the following situations tend to occur:

  • The data itself is open (S3 + Iceberg), but the catalog is tied to a specific engine
  • You want to reference the same table from another engine, but the catalog does not support it
  • Access control is fragmented across engines, making governance difficult
  • Every time the catalog is changed, all engine-side configurations must be redone

In other words, even if storage and formats are open, a closed catalog significantly reduces the benefits of a lakehouse.

Especially in today’s environments where multi-cloud, multiple products, and multiple engines are commonly combined, how to unify catalogs becomes a key challenge.


Requirements for an Open Catalog

Based on this background, lakehouse catalogs are expected to meet the following requirements:

Requirement Description
Compliance with standard APIs Support vendor-neutral APIs such as the Iceberg REST Catalog specification
Multi-engine support Usable across engines such as Spark, Trino, Flink, Snowflake, and Dremio
Centralized RBAC Define permissions at the catalog level and apply consistent governance across all engines
Multi-cloud / hybrid Not dependent on a specific cloud and capable of running on-premises when necessary
OSS sustainability Not discontinued based on vendor decisions; continuously developed in a community-driven manner

Apache Polaris is a catalog designed to satisfy these requirements.


Differences from AWS Glue Data Catalog

When building on AWS, AWS Glue Data Catalog is often positioned as the central data catalog.
Here, we will organize the differences between AWS Glue Data Catalog and Apache Polaris.


Positioning of AWS Glue Data Catalog

AWS Glue Data Catalog is a core metadata management service in AWS.

It is natively integrated with AWS analytics services such as Athena, Glue, Redshift Spectrum, and EMR, and plays the role of managing data on S3 as a catalog.

As discussed in previous articles, Glue Data Catalog is an excellent technical catalog used by data platforms.

Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies


Functional Comparison

Aspect AWS Glue Data Catalog Apache Polaris
Offering AWS-managed (closed) OSS / Managed (Snowflake Open Catalog, Dremio Catalog, etc.)
API AWS proprietary API (recently also provides Iceberg REST compatibility) Iceberg REST Catalog specification (open)
Cloud support AWS Multi-cloud / on-prem
Engines Athena, Glue, Redshift, EMR, Spark Spark, Trino, Flink, Snowflake, Dremio, StarRocks, DuckDB
Multi-catalog Account-level (logical separation via Lake Formation) Native support for multiple catalogs within a single instance
Access control IAM + Lake Formation Built-in RBAC (Principal / Principal Role / Catalog Role)
External catalog integration Limited Can integrate with Iceberg REST-compliant catalogs (Nessie, Gravitino, etc.)
Non-Iceberg formats Supports Hive, JSON, CSV, Parquet, etc. Currently Iceberg-centric (Generic Table support on roadmap)

How to Interpret the Difference

Rather than being in a competitive relationship, it is easier to understand them as catalogs with different roles.

  • AWS Glue Data Catalog: Strong integration with AWS services, making it the primary choice for workloads completed within AWS. It supports a wide range of data lake formats beyond Iceberg and features such as S3 crawling.
  • Apache Polaris: A catalog that enables governance across multiple engines and clouds based on the industry-standard Iceberg REST API. It is effective when you want to enforce consistent RBAC across engines outside AWS (e.g., Snowflake, Dremio).

In summary:

  • If your use case is AWS-contained and includes formats beyond Iceberg, Glue Data Catalog is a practical choice
  • If you want common management of Iceberg across multiple engines and a vendor-neutral catalog layer, Polaris is suitable

Differences from Snowflake Horizon Catalog

This is often confused, so let’s clarify the difference between Snowflake Horizon Catalog and Apache Polaris.
Note that it is different from “Snowflake Open Catalog,” despite the similar name.


What is Snowflake Horizon Catalog?

Snowflake Horizon Catalog is a data governance and discovery suite provided by Snowflake.

For data managed within Snowflake (Snowflake-managed tables, stages, views, shared data, etc.), it provides:

  • Data discovery (search, tagging, descriptions)
  • Lineage
  • Data quality monitoring
  • Masking policies and row access policies
  • Automatic classification of sensitive data
  • Compliance management

In terms of positioning, it is similar to Amazon DataZone + Lake Formation + Glue Data Quality in AWS.

In other words, it is the layer responsible for cataloging and governance so that people can discover, understand, and trust data.


What is Snowflake Open Catalog (Relation to Polaris)

On the other hand, Snowflake Open Catalog is a managed offering of Apache Polaris.

Although the name is confusing, this is the lakehouse catalog that serves as an Iceberg REST Catalog.

In Snowflake’s model:

  • Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake-managed data
  • Snowflake Open Catalog (= Apache Polaris): Lakehouse catalog layer for open table formats such as Iceberg

Functional Comparison

Aspect Snowflake Horizon Catalog Apache Polaris
Primary target Data in Snowflake (internal tables, shared data, etc.) Iceberg (Generic Table support for other formats is planned)
Layer Business catalog / governance layer Lakehouse catalog layer (technical catalog)
Offering Built into Snowflake (closed) OSS / Managed
API Snowflake proprietary Iceberg REST Catalog specification (open)
Data location Snowflake internal storage or recognized external data Iceberg tables on cloud storage
Scope Within Snowflake organizations Across multiple engines and clouds

How to Interpret the Difference

Again, these are not in opposition but complementary.

  • Snowflake Horizon Catalog: Upper layer that provides data to business users, handling discovery, quality, masking, etc.
  • Apache Polaris: Lower layer (metadata foundation) that exposes Iceberg tables to multiple engines

Conceptually, the structure looks like this:

┌──────────────────────────────────────────────┐
│  Business Catalog / Governance Layer         │ ← Snowflake Horizon Catalog
│  (Discovery / Lineage / Quality / Masking)   │   Amazon DataZone, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Lakehouse Catalog Layer                     │ ← Apache Polaris
│  (Iceberg REST Catalog / RBAC)               │   AWS Glue Data Catalog, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Data Lake (S3 / GCS / Azure Blob)           │
│  Iceberg / Parquet                           │
└──────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

If you think of Snowflake Horizon Catalog and Apache Polaris as “choosing one or the other,” it feels unnatural, but when organized as different layers, the division of responsibilities becomes clear.


How to Combine with AWS

From here, we will consider cases where Apache Polaris is introduced into an AWS environment.
Since AWS already has a powerful catalog called Glue Data Catalog, it is important to clarify how Polaris should be positioned and who is responsible for what.


Expected Architecture

Representative configurations can be organized into the following three patterns.


Pattern 1: AWS-only (Glue Data Catalog-centered)

This is the simplest configuration.
It is a typical setup using S3 + Iceberg + Glue Data Catalog, along with Athena / Glue / Redshift Spectrum.

  • Catalog: AWS Glue Data Catalog
  • Governance: IAM + Lake Formation
  • Query engines: Athena, Redshift Spectrum, Glue ETL, EMR

If everything is completed within AWS and there is no strong need to share with external engines, this configuration remains the most practical.
There is no need to forcibly introduce Apache Polaris.


Pattern 2: AWS + Snowflake (Using Polaris as a shared catalog foundation)

This configuration is effective when you want to reference the same Iceberg tables from both AWS (e.g., Athena) and Snowflake.

  • Data storage: S3 + Iceberg
  • Catalog: Apache Polaris (OSS self-hosted or Snowflake Open Catalog)
  • AWS side: Reference Polaris as an Iceberg REST Catalog (via Spark or third-party tools)
  • Snowflake side: Connect to Polaris using External Volume and Catalog Integration (CATALOG_SOURCE = POLARIS)

From the Snowflake side, Polaris can be referenced directly as follows:

CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog_int
  CATALOG_SOURCE = POLARIS
  TABLE_FORMAT = ICEBERG
  REST_CONFIG = (
    CATALOG_URI = 'https://<polaris-host>/api/catalog'
    CATALOG_NAME = '<your_polaris_catalog>'
  )
  REST_AUTHENTICATION = (
    TYPE = OAUTH
    OAUTH_CLIENT_ID = '<polaris_client_id>'
    OAUTH_CLIENT_SECRET = '<polaris_client_secret>'
    OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
  )
  ENABLED = TRUE;
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Multi-engine / multi-cloud configuration

In addition to Snowflake, this configuration includes multiple engines such as Dremio, Databricks, Trino, and Flink.

In this case, all engines reference Polaris as a common Iceberg REST Catalog.

  • Data storage: S3 (and other cloud storage if needed)
  • Catalog: Apache Polaris (center of governance)
  • Query engines: Snowflake, Dremio, Spark, Trino, Flink, etc.
  • Governance: Polaris provides unified RBAC across all engines

How to Think About Responsibility Separation

This is the key point.
When combining Polaris, AWS, Snowflake, and others, it is important to clearly define who is responsible for which layer.

Layer Primary Owner Notes
Data storage (files) AWS (S3) Storage location of the data. Single Source of Truth
Storage access control AWS (IAM) Access permissions to S3 buckets/prefixes are defined on the AWS side
Table metadata Apache Polaris Source of Truth for Iceberg metadata such as schema, snapshots, partitions
Table-level RBAC Apache Polaris Applies consistent permission rules across engines
ETL / pipelines AWS Glue / Lambda / EMR / Spark Responsible for ingestion and transformation
Query execution Athena / Snowflake / Dremio / Spark Engines selected based on use case
Business catalog / discovery Snowflake Horizon Catalog / Amazon DataZone Higher-layer features for search, lineage, quality for users
Data quality Glue Data Quality / Snowflake DMF Implemented at engine or quality service layer

What is especially important is the three-layer separation:

Data resides in AWS, the catalog is Polaris, and usage is handled by each engine

By making this separation explicit:

  • AWS can focus on storage and IAM management
  • Polaris can focus on metadata and access control
  • Each query engine can focus on its strengths

Considerations When Adopting Polaris

Polaris is powerful, but there are also important considerations:

  • Operational cost when self-hosting OSS: Running on EKS or EC2 requires a metastore (e.g., PostgreSQL), authentication infrastructure, monitoring, and upgrade handling
  • Managed services are often more practical: Using Snowflake Open Catalog or Dremio Catalog significantly reduces operational burden
  • Less seamless integration with AWS services compared to Glue: For AWS-native services such as Athena, Redshift, and QuickSight, using Glue Data Catalog is far more straightforward
  • Need to avoid double governance: If IAM policies on S3 and RBAC in Polaris are inconsistent, troubleshooting becomes complex

In other words, when deciding whether to adopt Apache Polaris in an AWS environment, it is realistic to evaluate based on:

  • Whether multi-engine requirements exist
  • The organization’s stance on vendor lock-in
  • Whether operational cost is acceptable (or managed services can be used)

A Practical Approach

Personally, when considering Polaris in an AWS environment, the following phased approach is practical:

  1. Build a lakehouse within AWS using Glue Data Catalog + Iceberg
  2. When integration with other engines such as Snowflake becomes necessary, consider introducing an Iceberg REST layer
  3. At that point, compare “Glue Iceberg REST endpoint,” “Apache Polaris OSS,” and “Snowflake Open Catalog” based on requirements
  4. If multi-engine / multi-cloud requirements become clear, redesign with Polaris (especially managed) at the center

Rather than designing with Polaris from the beginning, it is often more practical to replace the catalog layer with an open one when requirements mature.


Conclusion

In this article, we organized the key points around Apache Polaris.

In the world of data platforms, while storage and formats have become open, a closed catalog reduces the benefits of a lakehouse by half.

Therefore, there is a need for an open catalog that complies with the Iceberg REST Catalog specification and enables unified governance across multiple engines and clouds.
Apache Polaris is designed to fulfill exactly that role.

However, it is important to think not in terms of “which one to choose” among Polaris, AWS Glue Data Catalog, and Snowflake Horizon Catalog, but rather which layer each is responsible for:

  • AWS Glue Data Catalog: Technical catalog within AWS (still the primary choice for AWS-only workloads)
  • Apache Polaris: Lakehouse catalog centered on Iceberg, shared across multiple engines
  • Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake users

Even when combining with AWS, by consciously separating responsibilities as
“data in AWS, catalog in Polaris, analytics in engines, business catalog in another layer”,
you can design an architecture that leverages the strengths of each.

Going forward, lakehouse architectures are expected to increasingly adopt vendor-neutral designs.
Apache Polaris is likely to become an important component supporting that openness.

I hope this article will be helpful for those considering Apache Polaris or designing lakehouse architectures across multiple platforms such as AWS and Snowflake.

Top comments (0)