Prithvi S

Posted on Jun 18 • Edited on Jul 4

How Polaris Locks Down Cloud Storage: IAM, Trust, and the Anatomy of a Secure Table Request

#polaris #security #api #cloud

Most data catalogs handle the easy part: tracking tables, columns, and schemas. The hard part is making sure that when Spark, Trino, or Flink asks to read a table, it gets exactly the right credentials for exactly the right files and nothing else. No long-lived keys floating around. No blanket access to entire buckets. No hoping that someone rotated the IAM policy last quarter.

Apache Polaris takes a fundamentally different approach. Instead of handing out persistent credentials to every engine and hoping for the best, Polaris mints short-lived, scoped credentials on demand. But before a single credential gets vended, Polaris has already done substantial security work: establishing trust relationships with your cloud provider, validating storage locations, enforcing a two-tier RBAC model, and making sure every request passes through multiple authorization layers.

Let me walk you through how Polaris secures cloud storage from the ground up, from the moment you create a catalog to the moment a query engine reads a single Parquet file.

The Setup: Establishing Trust Before the First Request

Before Polaris can vend credentials for S3, GCS, or Azure, it needs to know who it is talking to and prove that it has the right to request access on your behalf. This happens during catalog creation, and the details vary by cloud provider.

S3: ARN, External ID, and the Trust Triangle

For AWS, Polaris uses a cross-account IAM role assumption pattern. When you configure an S3-backed catalog, you provide an IAM Role ARN. Polaris then uses AWS STS AssumeRole to request temporary credentials for that role. But AWS does not hand out credentials blindly. The trust policy on your IAM role must explicitly allow Polaris to assume it, and Polaris supports an external ID for additional protection against the confused deputy problem.

Here is the trust flow:

You create an IAM role in your AWS account with a trust policy that allows Polaris to assume it
You provide the role ARN and optional external ID to Polaris during catalog creation
Polaris stores this configuration in the catalog metadata
When an engine requests table access, Polaris calls sts:AssumeRole with that ARN
AWS returns temporary credentials scoped to the role's permissions
Polaris further restricts those credentials to the specific table path before handing them to the engine

The external ID is a subtle but critical detail. Without it, a third party who knows your role ARN could potentially convince AWS to issue credentials to them instead of Polaris. The external ID acts as a shared secret between you and Polaris, ensuring that only your Polaris instance can assume the role.

GCS: Service Account Delegation

For Google Cloud Storage, Polaris uses service account impersonation. You create a dedicated Google Cloud service account with the minimal permissions needed for your catalog and grant Polaris permission to impersonate it. During catalog creation, you provide the service account email address. When credential vending is needed, Polaris uses the Google Cloud IAM API to generate short-lived OAuth 2.0 access tokens for that service account.

The key difference from S3 is that GCS tokens are OAuth-based rather than session-based, but the principle is the same: Polaris never stores long-lived credentials, and the engine never sees the service account's private key.

Azure: Tenant ID and Managed Identity

For Azure Blob Storage, Polaris connects via tenant ID and either a managed identity or service principal. You configure the tenant ID, client ID, and client secret (or use a managed identity) during catalog setup. Polaris then requests tokens from Azure AD and uses them to generate SAS (Shared Access Signature) tokens for the engine. These SAS tokens are time-bound and scoped to specific containers or blobs, providing fine-grained access control.

Allowed Locations: The Perimeter Guard

Regardless of the cloud provider, Polaris enforces an "allowed locations" policy on every catalog. When you create a catalog, you specify the base storage locations that the catalog is permitted to access. Polaris validates every storage path against this whitelist before vending credentials. If an engine somehow requests access to a path outside the allowed locations, Polaris blocks the request at the storage integration layer, before any cloud API calls are made.

This is a defense-in-depth measure. Even if an attacker compromises the RBAC layer and somehow gets a credential vending request approved, they still cannot access storage outside the catalog's configured locations. The allowed locations act as a hard perimeter around the catalog's data footprint.

The Two-Tier RBAC Wall: Identity and Permissions, Separated

Once storage is configured, the next security layer is Polaris's two-tier RBAC model. Most systems conflate "who you are" with "what you can do." Polaris separates them explicitly.

Principal Roles: Who You Are

Principals in Polaris are service accounts or users. Each principal gets assigned one or more principal roles. These roles define the principal's identity within the system: a data scientist, a pipeline service account, a monitoring agent. Principal roles are global and answer the question: "who is making this request?"

Catalog Roles: What You Can Do

Catalog roles define permissions. A catalog role is a collection of privileges like TABLE_READ_DATA, TABLE_WRITE_DATA, CATALOG_MANAGE_ACCESS, or CATALOG_MANAGE_CONTENT. These roles are scoped to specific catalogs and can be granted to principal roles.

The separation works like this: a data scientist principal might have the "Data Scientist" principal role. That principal role is granted the "Read-Only" catalog role on the "Production Analytics" catalog. The same principal role could also be granted the "Full Access" catalog role on the "Development" catalog. The principal's identity is consistent, but their permissions vary by catalog.

This two-tier model has practical security benefits. When someone leaves the team, you revoke their principal role assignments, and all their catalog access disappears immediately. When a catalog's sensitivity changes, you modify the catalog role's privileges without touching any principal definitions. The separation of concerns makes audits simpler and reduces the blast radius of access changes.

Privileges: The Permission Matrix

Polaris defines a comprehensive privilege hierarchy. Key data access privileges include:

TABLE_READ_DATA - read table data via SELECT
TABLE_WRITE_DATA - insert, update, delete, or merge table data
TABLE_READ_PROPERTIES - read table metadata and properties
TABLE_WRITE_PROPERTIES - modify table metadata
VIEW_READ_DATA - read view data
VIEW_WRITE_DATA - modify view definitions
CATALOG_MANAGE_ACCESS - grant or revoke roles and privileges
CATALOG_MANAGE_CONTENT - create, drop, or alter tables and namespaces

Privileges are enforced by PolarisAuthorizer, which evaluates the principal's catalog roles against the requested action on the target entity. If any catalog role grants the required privilege, the request proceeds. If none do, the request is rejected before any storage operations are considered.

The Anatomy of a Secure Table Request

Now that we understand the setup, let us trace a single table read request through Polaris's security layers. This is where all the pieces come together.

Step 1: Authentication

The query engine sends an Iceberg REST API request to Polaris, typically with a Bearer token or mutual TLS. Polaris validates the token against its configured identity provider. The result is a Principal object representing the authenticated caller.

Step 2: Principal Role Resolution

Polaris looks up the principal's assigned principal roles. For this example, let us say our principal has the "Data Scientist" principal role.

Step 3: Catalog Role Resolution

The request targets a specific catalog. Polaris looks up which catalog roles the principal's roles have been granted on this catalog. If the "Data Scientist" principal role has been granted the "Read-Only" catalog role on this catalog, that catalog role is collected.

Step 4: Privilege Check

The request asks to read table data, which requires TABLE_READ_DATA. Polaris checks if any of the resolved catalog roles grant this privilege. If not, the request is rejected immediately with a 403. No storage APIs are called, no credentials are vended, and no cloud costs are incurred.

Step 5: Entity Resolution

With the privilege check passed, Polaris resolves the target entity: the catalog, namespace, and table. It fetches the table metadata from the persistence layer, typically via AtomicMetaStoreManager and a JDBC backend. This metadata includes the table's location in cloud storage.

Step 6: Storage Configuration Lookup

Polaris retrieves the catalog's storage configuration: the S3 role ARN, GCS service account, or Azure tenant ID that was configured during setup. It also retrieves the allowed locations list for this catalog.

Step 7: Location Validation

Before any credential vending happens, Polaris validates that the table's storage location falls within the catalog's allowed locations. If the table is stored at s3://production-analytics/fact_orders/ and the allowed location is s3://production-analytics/, validation passes. If the table somehow resolved to s3://other-bucket/, Polaris would reject the request here, even though the RBAC check passed.

Step 8: Credential Vending

Now Polaris calls the cloud provider's credential API. For S3, it calls sts:AssumeRole with the catalog's role ARN. For GCS, it requests an OAuth token via the IAM API. For Azure, it generates a SAS token.

But here is the critical detail: Polaris does not just pass through whatever credentials the cloud provider returns. It restricts them further. The cloud credentials are scoped to the specific table path, not the entire catalog or bucket. And they are time-bound, typically to about 15 minutes (configurable).

For a read request, Polaris ensures the credentials are read-only. For a write request, it ensures they have write access. The principle of least privilege is enforced at the cloud credential level, not just the Polaris privilege level.

Step 9: Response

Polaris returns the Iceberg REST API response to the engine, including the short-lived credentials and the table metadata. The engine uses those credentials to read the actual data files from cloud storage. The credentials expire automatically, and the engine must return to Polaris for fresh credentials on subsequent requests.

Step 10: Audit and Revocation

Every step in this flow is auditable. Polaris logs the principal, the request, the privilege check result, the entity accessed, and the credential vending action. Because credentials are short-lived and Polaris is the single point of issuance, revocation is instantaneous. If a principal's access is revoked at step 2, any credentials they previously received will expire within minutes and cannot be renewed.

Federated Credentials: Extending Security Beyond Polaris (v1.3.0)

Polaris 1.3.0 introduced a significant security enhancement: federated credential vending. Previously, when Polaris managed an external catalog (like a Hive or Hadoop-backed catalog), the query engine would use the external catalog's credentials directly. Polaris acted as a metadata pass-through, but the actual storage access was governed by the external system's credential model.

With v1.3.0, Polaris can mint credentials for external catalogs just as it does for internal ones. This means:

Unified security model: Whether a table is in an internal Polaris-managed catalog or a federated external catalog, the credential vending flow is the same: RBAC check, location validation, scoped credentials, automatic expiration.
No credential leakage: Engines never see the external catalog's long-lived credentials. They only see the short-lived, scoped credentials that Polaris mints on their behalf.
Centralized audit: All credential access flows through Polaris, even for data that lives in external systems. This gives security teams a single point of observability.
Instant revocation: Revoking a principal's Polaris role immediately cuts off their access to federated data, without needing to touch the external system's IAM configuration.

This is a subtle but important shift. Polaris is evolving from a metadata catalog into a unified security control plane for all tabular data, regardless of where it lives.

Why This Matters: The Alternative Is Credential Chaos

Without Polaris's model, organizations typically fall into one of two traps:

Trap 1: The Shared Key. Every engine gets the same long-lived IAM key with broad bucket access. Rotation is painful, so it rarely happens. When someone leaves, the key stays the same. When an engine is compromised, the attacker has access to everything.

Trap 2: The IAM Sprawl. Every team maintains their own IAM roles, service accounts, and policies. A data scientist needs access to a table, so they open a ticket. Someone creates a role. The role gets attached to a service account. The service account gets shared. Six months later, nobody knows which roles are still needed, but nobody dares delete them. This is the classic IAM debt spiral.

Polaris avoids both traps by making the catalog the single point of access control. The cloud IAM roles are minimal: they only need to allow Polaris to assume them. Polaris handles the rest: identity, permissions, scoping, expiration, and audit. The IAM surface area shrinks dramatically, and the access model becomes comprehensible.

Performance and Security: Not a Trade-off

A common objection to credential vending is latency. If every table read requires a cloud API call to mint credentials, does that not slow things down? The answer is yes, but Polaris mitigates it aggressively.

Credential minting takes roughly 100-200 milliseconds per cloud API call. Polaris caches credentials keyed by the principal, catalog, table, and operation type. If the same engine requests the same table again within the cache window, Polaris returns the cached credentials without calling the cloud provider. The cache TTL is shorter than the credential expiration time, so there is no risk of serving expired credentials.

For most query patterns, the cache hit rate is high. Engines tend to read the same tables repeatedly during a query session, and batch reads for the same table are common. The practical overhead is minimal, and the security benefit is substantial.

Operational Security: Deploying Polaris Securely

Security is not just about the architecture; it is also about how you run it. Polaris provides several operational controls to keep the deployment secure:

TLS everywhere: The Iceberg REST API and all internal communications should run over TLS 1.2 or higher. Polaris supports certificate-based authentication for service-to-service communication.
Admin bootstrapping: The polaris-admin tool creates the initial principal and role assignments during first-time setup. This should be run in a secure environment, and the initial credentials should be rotated immediately after setup.
Docker and Helm: Polaris distributes Docker images and Helm charts. Security teams should scan these images, run them with non-root users, and restrict their network access to only the necessary ports and peers.
Persistence encryption: The polaris-relational-jdbc persistence layer should connect to a database with TLS encryption. The database itself should be encrypted at rest and protected by its own access controls.
Health and metrics endpoints: Polaris 1.3.0 standardized /q/health and /q/metrics endpoints. These should be exposed to monitoring systems but protected from public access, as they may reveal operational details.

Conclusion

Polaris approaches cloud storage security with a simple but powerful principle: the catalog should be the gatekeeper, not just the librarian. By combining cloud-native trust relationships, two-tier RBAC, location validation, short-lived credential vending, and comprehensive audit logging, Polaris creates a security model that is both strong and operable.

The next time you are evaluating a catalog for your data lake, ask not just "can it track my tables?" but also "can it keep my cloud credentials under control?" Polaris answers the second question with a resounding yes.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community