DEV Community: Aki

Verifying How IAM and Lake Formation Behave for the Glue REST Catalog and S3 Tables

Aki — Mon, 13 Jul 2026 22:07:59 +0000

Original Japanese article: Glue REST CatalogとS3TablesのIAMとLake Formationの挙動を確かめる

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my previous article, I hit the Iceberg REST Catalog directly and confirmed the design differences between the Glue endpoint and the S3 Tables endpoint.

In that article I wrote that "accessing through the Glue endpoint requires a Lake Formation grant." This time, I'll deliberately vary the IAM policy and the Lake Formation grant to see exactly how the behavior of each endpoint changes.

Along the way, I'll also check whether GetDataAccess is actually being called via CloudTrail, to peek inside the authorization flow itself.

I previously wrote about the design differences between the Glue and S3 Tables Iceberg REST endpoints in another article, worth a read alongside this one: Hitting the Iceberg REST Catalog Directly: Understanding the Differences Between Glue Data Catalog and S3 Tables

What We're Verifying Today

According to AWS documentation, the Glue REST Catalog is authorized through a combination of IAM policy and Lake Formation grants, while the S3 Tables REST endpoint is authorized through IAM alone.

Glue endpoint: IAM authorization → Lake Formation authorization
S3 Tables endpoint: s3tables IAM authorization only

Let's verify how this actually plays out, using the following matrix:

Condition	Glue endpoint	S3 Tables endpoint
IAM ✓ / LF ✓	200 (baseline)	200 (baseline)
IAM ✓ / LF ✗	?	?
IAM ✗ / LF ✓	?	?

The IAM ✓ / LF ✗ cell is especially important:

If only the Glue endpoint returns 403 → Glue is genuinely enforcing Lake Formation authorization
If only the S3 Tables endpoint returns 200 → that confirms the difference in authorization flow

After each cell, I'll also check whether lakeformation:GetDataAccess was called via CloudTrail, to visualize the internal authorization flow.

Test Environment Setup

I'm reusing the resources created in the previous article.

Table bucket: penguin-rest-test
Namespace: analytics
Table: daily_sales
Glue integration: enabled
register-resource --with-federation: already configured (vended credentials are issued)

Note

In this test, when using Glue REST Catalog's federation, the Lake Formation permission-evaluation target was the IAM role used to vend credentials. This can be confirmed from the CloudTrail GetDataAccess event, where the session that assumed that IAM role is recorded as the userIdentity.

"IAM ✓" state: the test IAM user has the following policy attached.

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["glue:GetCatalog", "glue:GetDatabase", "glue:GetTable"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3tables:GetTableBucket", "s3tables:GetNamespace", "s3tables:GetTable",
        "s3tables:GetTableData", "s3tables:GetTableMetadataLocation"
      ],
      "Resource": [
        "arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test",
        "arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test/table/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": ["lakeformation:GetDataAccess"],
      "Resource": "*"
    }
  ]
}

"LF ✓" state: the IAM role used to vend credentials (penguin-irc-test-role) has the following Lake Formation grants.

Database (analytics): DESCRIBE
Table (daily_sales): SELECT, DESCRIBE

IAM ✓ / LF ✓ (Baseline State)

Glue Endpoint

awscurl --service glue --region ap-northeast-1 --profile penguin-irc-test \
  "https://glue.ap-northeast-1.amazonaws.com/iceberg/v1/catalogs/123456789012:s3tablescatalog:penguin-rest-test/namespaces/analytics/tables/daily_sales"

Result (excerpt):

{
  "config": {
    "s3.access-key-id": "ASIAXXXXXXXXXXXXXXXX",
    "s3.secret-access-key": "(masked)",
    "s3.session-token": "(masked)"
  },
  "metadata-location": "s3://...(omitted)"
}

HTTP 200, with a response containing vended credentials.

S3 Tables Endpoint

BUCKET_ARN_PATH="arn:aws:s3tables:ap-northeast-1:123456789012:bucket%2Fpenguin-rest-test"

awscurl --service s3tables --region ap-northeast-1 --profile penguin-irc-test \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/${BUCKET_ARN_PATH}/namespaces/analytics/tables/daily_sales"

Result (excerpt):

{
  "config": {
    "tableBucketId": "3b2a6702-dd99-44f2-bc03-2f9f6b273104",
    "namespaceId": "20576faa-861e-499f-a30f-4d6c7550dd55",
    "tableId": "34c72c19-610d-4e5c-bee4-6ae9f093c583"
  },
  "metadata-location": "s3://...(omitted)"
}

HTTP 200, with a response containing internal IDs (no credentials).

Checking CloudTrail (Glue Endpoint → GetDataAccess)

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetDataAccess \
  --region ap-northeast-1 --max-results 3

Result (excerpt):

{
  "EventName": "GetDataAccess",
  "EventTime": "2026-07-10T14:39:35+09:00",
  "EventSource": "lakeformation.amazonaws.com",
  "Username": "penguin-irc-test-session",
  "CloudTrailEvent": "...\"invokedBy\":\"glue.amazonaws.com\"..."
}

GetDataAccess is recorded with invokedBy: glue.amazonaws.com. The event was logged within seconds of when the request was sent, which shows that Glue queries Lake Formation synchronously.

No GetDataAccess was recorded for the request made against the S3 Tables endpoint.

Checking CloudTrail (the S3 Tables Endpoint's Own Call)

In addition to confirming that GetDataAccess isn't recorded, let's also check how the S3 Tables endpoint's own API call (GetTableMetadataLocation) shows up in CloudTrail.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetTableMetadataLocation \
  --region ap-northeast-1 --max-results 3 --profile toyano

Result (excerpt, masked):

{
  "EventName": "GetTableMetadataLocation",
  "EventTime": "2026-07-10T14:41:05+09:00",
  "EventSource": "s3tables.amazonaws.com",
  "Username": "penguin-irc-test-session",
  "CloudTrailEvent": "{\"userIdentity\":{\"type\":\"AssumedRole\",\"arn\":\"arn:aws:sts::123456789012:assumed-role/penguin-irc-test-role/penguin-irc-test-session\"},\"eventSource\":\"s3tables.amazonaws.com\",\"eventName\":\"GetTableMetadataLocation\",...}"
}

The eventSource is recorded simply as s3tables.amazonaws.com on its own, and there's no field equivalent to the invokedBy we saw in the Glue-side GetDataAccess event anywhere in the CloudTrailEvent. This confirms, from this angle as well, that authorization on the S3 Tables endpoint is a direct IAM evaluation of the caller's own credentials, rather than a delegated call through another service.

Whether Username shows up as penguin-irc-test (the IAM user itself) or penguin-irc-test-session (the session for the role used to vend credentials) lines up exactly with userIdentity.type (IAMUser vs. AssumedRole), giving us a way to tell which principal actually processed each request.

IAM ✓ / LF ✗ (Removing the LF Grant)

Let's remove the Lake Formation grant from the IAM role used to vend credentials (penguin-irc-test-role).

aws lakeformation revoke-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/penguin-irc-test-role \
  --permissions "SELECT" "DESCRIBE" \
  --resource '{"Table":{"CatalogId":"123456789012:s3tablescatalog/penguin-rest-test","DatabaseName":"analytics","Name":"daily_sales"}}'

aws lakeformation revoke-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/penguin-irc-test-role \
  --permissions "DESCRIBE" \
  --resource '{"Database":{"CatalogId":"123456789012:s3tablescatalog/penguin-rest-test","Name":"analytics"}}'

Glue Endpoint

{
  "error": {
    "code": 403,
    "message": "Insufficient Lake Formation permission(s): Required Describe on daily_sales",
    "type": "AccessDeniedException"
  }
}

HTTP 403, with an error indicating insufficient Lake Formation permissions.

S3 Tables Endpoint

{
  "config": {
    "tableBucketId": "3b2a6702-dd99-44f2-bc03-2f9f6b273104",
    "namespaceId": "20576faa-861e-499f-a30f-4d6c7550dd55",
    "tableId": "34c72c19-610d-4e5c-bee4-6ae9f093c583"
  },
  "metadata-location": "s3://...(omitted)"
}

Still 200, no change. Whether or not the Lake Formation grant exists has no effect on authorization for the S3 Tables endpoint.

Checking CloudTrail (Glue Endpoint → GetDataAccess)

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetDataAccess \
  --region ap-northeast-1 --max-results 3

Result (excerpt):

{
  "EventName": "GetDataAccess",
  "EventTime": "2026-07-10T14:51:29+09:00",
  "EventSource": "lakeformation.amazonaws.com",
  "Username": "penguin-irc-test-session",
  "CloudTrailEvent": "...\"invokedBy\":\"glue.amazonaws.com\"..."
}

Even after removing the Lake Formation grant, GetDataAccess is recorded with invokedBy: glue.amazonaws.com, at roughly the same time the request was sent. For the table-retrieval operation tested here on the Glue endpoint, GetDataAccess is recorded even in the 403 case, confirming that the request does reach Lake Formation's permission evaluation.

IAM ✗ / LF ✓ (Removing the s3tables IAM Actions)

After restoring the Lake Formation grant on the role, let's remove the s3tables-related actions from the IAM policy.

aws iam put-user-policy \
  --user-name penguin-irc-test \
  --policy-name GlueIRCMinimal \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": ["glue:GetCatalog","glue:GetDatabase","glue:GetTable"],
        "Resource": "*"
      },
      {
        "Effect": "Allow",
        "Action": ["lakeformation:GetDataAccess"],
        "Resource": "*"
      }
    ]
  }'

Glue Endpoint

{
  "error": {
    "code": 403,
    "message": "From federation source: User: arn:aws:iam::123456789012:user/penguin-irc-test is not authorized to perform: s3tables:GetTableBucket on resource: arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test because no identity-based policy allows the s3tables:GetTableBucket action",
    "type": "AccessDeniedException"
  }
}

HTTP 403. In this test, when the calling IAM user lacked s3tables permissions, we got a From federation source error, confirming that the request never even reaches Lake Formation's evaluation.

S3 Tables Endpoint

{
  "error": {
    "code": 403,
    "message": "User: arn:aws:iam::123456789012:user/penguin-irc-test is not authorized to perform: s3tables:GetTableMetadataLocation on resource: ...",
    "type": "forbidden"
  }
}

Also 403, unsurprising, since the s3tables IAM actions are missing.

Checking CloudTrail (Glue Endpoint → GetDataAccess)

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetDataAccess \
  --region ap-northeast-1 --max-results 3

Result (excerpt):

No matching GetDataAccess event was found.

No GetDataAccess was recorded corresponding to this request. This confirms that when the s3tables-related IAM permissions are removed, Lake Formation is never even queried in the first place, and the request is denied at the IAM authorization stage.

Results Summary

Condition	Glue endpoint	S3 Tables endpoint	GetDataAccess
IAM ✓ / LF ✓	200	200	Present (14:39:35, recorded under the federation role's session)
IAM ✓ / LF ✗	403	200	Present (14:51:29, recorded under the federation role's session)
IAM ✗ / LF ✓	403	403	Absent

Authorization Flow

The results confirm that each endpoint follows a genuinely different flow.

Glue REST Catalog

Calling IAM user
    |
    | Checks s3tables IAM permissions
    | via the Glue API
    ↓
Glue REST Catalog
    |
    | GetDataAccess
    | (evaluated as the federation IAM role)
    ↓
Lake Formation grant evaluation
    |
    ↓
Vended credentials issued

Here's what this test confirmed:

In this test, when the calling IAM user lacked s3tables IAM permissions, the request was denied at the IAM authorization stage without ever reaching Lake Formation's evaluation (this is where the From federation source error appears).
Once the s3tables IAM permissions are satisfied, Lake Formation permission evaluation is performed with the IAM role used to vend credentials as the principal.
GetDataAccess is recorded in CloudTrail as invokedBy: glue.amazonaws.com (it's recorded even in the 403 case when LF permissions are missing). The request time and the event time are nearly identical, showing that the evaluation happens synchronously.

S3 Tables REST Endpoint

Calling IAM user
    |
    | s3tables IAM authorization
    ↓
S3 Tables REST endpoint

This test confirmed that the S3 Tables REST endpoint operates purely on an IAM-based authorization model.

Here's what this test confirmed:

Access is controlled by the calling user's IAM policy for s3tables API authorization (no Lake Formation involvement was observed in this test).
Lake Formation-based permission control was not observed on this S3 Tables REST endpoint access path.
GetDataAccess is not recorded in CloudTrail.

Conclusion

In this article, I confirmed the authorization flow for the Glue REST Catalog and the S3 Tables REST endpoint by switching the IAM policy and Lake Formation grants on and off.

To summarize:

For the Glue REST Catalog, Lake Formation's GetDataAccess is executed only after the calling IAM user's s3tables-related IAM permissions are satisfied.
Lake Formation's permission evaluation is performed with the IAM role used to vend credentials as the principal, not the calling IAM user.
Even when a Lake Formation permission shortfall results in a 403, GetDataAccess is still recorded in CloudTrail, confirming that the request does reach Lake Formation's permission evaluation (the request time and event time are nearly identical).
On the other hand, when the calling IAM user's s3tables-related IAM permissions are insufficient, GetDataAccess is not recorded, and Lake Formation's permission evaluation never runs.
For the S3 Tables REST endpoint, within the scope of this test, GetDataAccess was never recorded and no Lake Formation involvement was observed. The S3 Tables endpoint's own CloudTrail events also have no field equivalent to invokedBy, further confirming that authorization is a direct IAM evaluation.

By combining CloudTrail with the actual API responses, I was able to observe the real authorization flow, something that isn't visible from the documentation's description of "Lake Formation is used" alone.

The CloudTrail logs also back up the difference that the Glue REST Catalog goes through both IAM and Lake Formation authorization, while the S3 Tables REST endpoint is authorized purely through IAM.

To be fair, this all matched the documented behavior, but going through it hands-on still surfaced plenty of small insights and things I learned along the way.

I've found that combining CloudTrail with API responses like this is a genuinely useful technique for understanding the internal behavior of other AWS services too. I'd like to keep running experiments like this to uncover things about AWS's internals that documentation alone doesn't show.

I hope this article is useful to anyone trying to understand the authorization architecture behind the Iceberg REST Catalog on AWS.

Hitting the Iceberg REST Catalog Directly: Understanding the Differences Between Glue Data Catalog and S3 Tables

Aki — Fri, 10 Jul 2026 03:32:10 +0000

Original Japanese article: Iceberg REST Catalogを直接叩いて、Glue Data CatalogとS3 Tablesの違いを理解する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

Most of the time, when working with Iceberg tables, we reach for PyIceberg or Spark. I'm no exception, and honestly there were parts of the PyIceberg configuration — rest.sigv4-enabled, rest.signing-name, warehouse — that I understood only vaguely.

Iceberg defines a standard called the Iceberg REST Catalog Open API specification, and AWS implements it through two separate endpoints:

The AWS Glue Iceberg REST endpoint (https://glue.<region>.amazonaws.com/iceberg)
The Amazon S3 Tables Iceberg REST endpoint (https://s3tables.<region>.amazonaws.com/iceberg)

If two implementations follow the same spec, sending the same requests to both and comparing the results should reveal what's actually different between them.

In this article, I'll bypass clients like PyIceberg entirely and hit the REST API directly to explore the differences between the two endpoints.

To state the conclusion up front:

Even though both implement the same Iceberg REST Catalog specification, Glue is designed as an "entry point to multiple catalogs," while S3 Tables is designed as an "entry point to a single table bucket." That difference is visible just by looking at the URL paths.

I previously wrote about the relationship between S3 Tables and Glue Data Catalog in another article — worth a read alongside this one:
Does Amazon S3 Tables Replace AWS Glue Data Catalog? Understanding Their Relationship

What Is the Iceberg REST Catalog?

The Iceberg REST Catalog is a specification that standardizes Iceberg catalog operations as an HTTP API. It's published as an OpenAPI definition (YAML), and any catalog that conforms to it can be accessed the same way from clients such as PyIceberg, Spark, and Trino.

The key points of the spec are:

URL paths follow a pattern like GET /v1/{prefix}/namespaces, where {prefix} is a free-form segment
Clients first call GET /v1/config to retrieve endpoint configuration (the default prefix and other settings)
The table metadata body (schema, snapshots, etc.) is returned as JSON in the LoadTable response

In other words, when you write catalog.load_table("ns.table") in PyIceberg, what's actually happening under the hood is an HTTP request to GET /v1/{prefix}/namespaces/ns/tables/table.

Here's a summary of the two AWS implementations before we dive in:

Item	Glue Iceberg REST endpoint	S3 Tables Iceberg REST endpoint
Endpoint	`https://glue.<region>.amazonaws.com/iceberg`	`https://s3tables.<region>.amazonaws.com/iceberg`
Contents of `{prefix}`	`/catalogs/{catalog}` (catalog hierarchy)	URL-encoded table bucket ARN
Value passed as `warehouse`	Glue catalog ID	Table bucket ARN
SigV4 signing service name	`glue`	`s3tables`
Access control	IAM + Lake Formation	s3tables IAM actions only

Here's the overall picture as a diagram:

              Iceberg REST Catalog spec
           GET /v1/{prefix}/namespaces/...
                        │
        ┌───────────────┴───────────────┐
        │                               │
  Glue endpoint                  S3 Tables endpoint
        │                               │
 prefix = /catalogs/{catalog}    prefix = table bucket ARN
 (multi-catalog hierarchy)       (one bucket = one catalog)
        │                               │
  IAM + Lake Formation           s3tables IAM actions
        │                               │
        └───────────────┬───────────────┘
                        │
              Same Iceberg table
          (points to the same metadata.json)

Note: a catalog doesn't manage the metadata.json file itself — it provides a reference to where the latest metadata-location is. The catalog's essential job is knowing where the current metadata.json lives.

Setting Up the Test Environment

Let's create a test table bucket, namespace, and table via the CLI. Read the region as Tokyo (ap-northeast-1) and the account ID as 123456789012.

# Create a table bucket
aws s3tables create-table-bucket \
  --name penguin-rest-test \
  --region ap-northeast-1

# Create a namespace
aws s3tables create-namespace \
  --table-bucket-arn arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test \
  --namespace analytics

# Create a table (with schema)
aws s3tables create-table \
  --table-bucket-arn arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test \
  --namespace analytics \
  --name daily_sales \
  --format ICEBERG \
  --metadata '{
    "iceberg": {
      "schema": {
        "fields": [
          {"name": "sales_date", "type": "date", "required": false},
          {"name": "amount", "type": "long", "required": false}
        ]
      }
    }
  }'

SigV4 Signing

You can't hit these endpoints with plain curl.

The Iceberg REST Catalog spec defines an OAuth2-based authentication flow, but AWS's implementation uses IAM SigV4 signing instead — a standard-spec API with AWS-flavored authentication.

That rest.sigv4-enabled: true setting in PyIceberg is exactly what enables this signing.

Computing SigV4 signatures by hand is painful, so I used awscurl for this exercise.

pip install awscurl

awscurl picks up credentials from environment variables or a profile and sends SigV4-signed requests for you.

The important part is the --service option, which must specify the correct service name for signing: glue for the Glue endpoint, s3tables for the S3 Tables endpoint.

Hitting the S3 Tables Iceberg REST Endpoint

Let's start with the simpler of the two — the S3 Tables endpoint.

GET /v1/config

getConfig is the first API a REST Catalog client calls. The warehouse query parameter takes the table bucket ARN.

Note

When using awscurl, pass the raw, unencoded ARN as the query parameter value. awscurl automatically URL-encodes query parameters before computing the signature, so if you pre-encode the value yourself, it gets double-encoded and you'll get a SignatureDoesNotMatch error.

BUCKET_ARN="arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test"

awscurl --service s3tables --region ap-northeast-1 \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/config?warehouse=${BUCKET_ARN}"

Result (excerpt):

{
  "defaults": {
    "prefix": "arn%3Aaws%3As3tables%3Aap-northeast-1%3A123456789012%3Abucket%2Fpenguin-rest-test",
    "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "write.object-storage.enabled": "true",
    "write.object-storage.partitioned-paths": "false",
    "s3.delete-enabled": "false",
    "rest-metrics-reporting-enabled": "false"
  },
  "overrides": {}
}

The interesting part is the prefix in the response. For the S3 Tables endpoint, prefix is exactly the URL-encoded table bucket ARN, and it lives under defaults. In other words, this endpoint is designed around a "one table bucket = one catalog" model — if you want to work with multiple table buckets, you register multiple catalogs on the client side.

defaults also included the write-side FileIO implementation (io-impl), object storage layout settings (write.object-storage.*), and a setting that disables S3 delete operations (s3.delete-enabled). It's interesting to see, from actual output, that some of the parameters we're used to configuring individually on the client (PyIceberg) side are actually being pushed down as server-side defaults through this config response.

Incidentally, this getConfig call is authorized under the s3tables:GetTableBucket IAM action. The official documentation includes a mapping table showing which s3tables IAM action corresponds to each REST operation.

Note

The spec's CatalogConfig also defines an endpoints field, which lets the server return a list of supported endpoints (in a format like "GET /v1/{prefix}/namespaces"). When I checked, the S3 Tables endpoint's response did not include an endpoints field.

Listing Namespaces and Tables

Now that we know the prefix, let's list namespaces and tables.

Note

There's a subtlety worth calling out here.

The warehouse query parameter for GET /v1/config needs the raw ARN, since awscurl automatically URL-encodes query parameters before signing — pre-encoding it yourself causes double-encoding and a SignatureDoesNotMatch error.

The {prefix} segment in the path, however, behaves differently. If you put the ARN into the path completely unencoded (colons and slashes as-is), the / characters inside the ARN get interpreted as path separators, and the request no longer matches the modeled URL pattern /v1/{prefix}/namespaces — you get an UnknownOperationException.

The correct approach is to leave the colons raw but percent-encode only the single / inside bucket/<table-bucket-name> as %2F. That makes the request match the correct route (and once it reaches the authorization layer, you'll get a 403 if you lack permissions, rather than a routing error).

Even though it's the same operation — putting an ARN into a URL — the encoding rules differ between query parameters and path segments. That's something you only really notice by actually hitting the endpoint.

BUCKET_ARN_PATH="arn:aws:s3tables:ap-northeast-1:123456789012:bucket%2Fpenguin-rest-test"

# List namespaces
awscurl --service s3tables --region ap-northeast-1 \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/${BUCKET_ARN_PATH}/namespaces"

# List tables
awscurl --service s3tables --region ap-northeast-1 \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/${BUCKET_ARN_PATH}/namespaces/analytics/tables"

Result (excerpt):

{
  "namespaces": [
    ["analytics"]
  ]
}

{
  "identifiers": [
    {
      "name": "daily_sales",
      "namespace": ["analytics"]
    }
  ]
}

There's one difference from the spec worth noting here: the Iceberg REST spec allows multi-level namespaces (like a.b.c), but S3 Tables only supports a single level.

Let's confirm this by trying to create a multi-level namespace:

awscurl --service s3tables --region ap-northeast-1 \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["level1", "level2"]}' \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/${BUCKET_ARN_PATH}/namespaces"

Result (excerpt):

{
  "error": {
    "code": 400,
    "message": "Multipart namespaces are not supported.",
    "type": "bad_request"
  }
}

The error message spells it out directly — "multipart namespaces are not supported" — confirming the single-level restriction empirically.

Retrieving Metadata with LoadTable

This is the API I most wanted to check today.

awscurl --service s3tables --region ap-northeast-1 \
  "https://s3tables.ap-northeast-1.amazonaws.com/iceberg/v1/${BUCKET_ARN_PATH}/namespaces/analytics/tables/daily_sales"

Result (excerpt):

{
  "metadata-location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3/metadata/00000-163a447a-c64d-44f9-8619-5db9561e549b.metadata.json",
  "metadata": {
    "format-version": 2,
    "table-uuid": "9507ea2b-7a71-40a7-b6a9-1d6e633955e1",
    "location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3",
    "current-schema-id": 0,
    "schemas": [
      {
        "type": "struct",
        "schema-id": 0,
        "fields": [
          { "id": 1, "name": "sales_date", "required": false, "type": "date" },
          { "id": 2, "name": "amount", "required": false, "type": "long" }
        ]
      }
    ],
    "default-spec-id": 0,
    "partition-specs": [{ "spec-id": 0, "fields": [] }],
    "properties": {
      "write.parquet.compression-codec": "zstd"
    },
    "current-snapshot-id": -1,
    "snapshots": []
  },
  "config": {
    "tableBucketId": "3b2a6702-dd99-44f2-bc03-2f9f6b273104",
    "namespaceId": "20576faa-861e-499f-a30f-4d6c7550dd55",
    "tableId": "34c72c19-610d-4e5c-bee4-6ae9f093c583"
  }
}

This JSON is exactly what's behind the information we normally see through table.schema() or table.snapshots() in PyIceberg. The response directly confirms that a catalog's essential job is "managing and returning the S3 path of the latest metadata.json" (metadata-location).

Incidentally, the config field in the response contained S3 Tables-internal identifiers: tableBucketId, namespaceId, and tableId. Per spec, the Glue endpoint's config is supposed to include temporary credentials (vended credentials), but here on the S3 Tables endpoint there's no credential delegation at all — it just returns internal management IDs.

Note

The S3 Tables endpoint also documents a limit: operations against a table whose metadata.json exceeds 50MB return a 400 error. Worth keeping in mind — if metadata bloats from accumulated snapshots and similar growth, you could hit this API-level ceiling.

Checking What's Happening Under the Hood via CloudTrail

An interesting characteristic of the S3 Tables endpoint is that REST API calls get logged in CloudTrail as their corresponding native S3 Tables actions. Per the official documentation, a single LoadTable call logs both of the following:

GetTableMetadataLocation (a management event)
A data event corresponding to GetTableData

In other words, the audit log itself reveals that this endpoint is implemented as a proxy that translates the Iceberg REST API into native S3 Tables API calls.

Let's check CloudTrail directly:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetTableMetadataLocation \
  --region ap-northeast-1 \
| jq '.Events[0] | {time: .EventTime, user: .Username, event: .EventName}'

Result (excerpt):

{
  "time": "2026-07-09T13:46:44+00:00",
  "user": "penguin-test-user",
  "event": "GetTableMetadataLocation"
}

Even though we only called LoadTable over REST, CloudTrail logs it under the native S3 Tables event name s3tables:GetTableMetadataLocation, complete with the exact caller (IAM username).

Hitting the Glue Iceberg REST Endpoint

Now let's look at the Glue endpoint, which behaves quite differently.

GET /v1/config

awscurl --service glue --region ap-northeast-1 \
  "https://glue.ap-northeast-1.amazonaws.com/iceberg/v1/config?warehouse=123456789012"

Result (excerpt):

{
  "defaults": {
    "prefix": "123456789012",
    "header.Content-Type": "application/x-amz-json-1.1",
    "rest.sigv4-enabled": "true",
    "rest.signing-name": "glue",
    "rest.signing-region": "ap-northeast-1",
    "rest-table-scan-enabled": "true",
    "rest-data-commit-enabled": "true",
    "token-refresh-enabled": "false"
  },
  "overrides": {
    "prefix": "catalogs/123456789012"
  }
}

There's a difference from the S3 Tables endpoint worth calling out here: defaults.prefix is "123456789012" (just the account ID), while overrides.prefix is "catalogs/123456789012" (with catalogs/ prepended) — the two values genuinely differ. Per spec, overrides takes precedence over defaults, so the prefix a client should actually use is catalogs/123456789012. It's confirmed, from actual output, that defaults and overrides really can disagree.

The other defaults entries are worth a look too — rest.sigv4-enabled and rest.signing-name map directly onto the PyIceberg configuration items of the same name. rest-table-scan-enabled and rest-data-commit-enabled are presumably feature flags for server-side scan planning and write commits, discussed further below.

Incidentally, this response also didn't include an endpoints field.

What you pass as warehouse for the Glue endpoint is a Glue catalog ID (defaulting to the current account's root catalog if omitted). Where the S3 Tables endpoint's warehouse was "a bucket ARN," here it's "where in the catalog hierarchy to connect."

Prefix Rules: Encoding the Catalog Hierarchy

The Glue endpoint's prefix always takes the form /catalogs/{catalog}. The official documentation defines the following rules for how {catalog} is written:

Target	Prefix notation	Example REST path
Default catalog of the current account	`:`	`GET /v1/catalogs/:/namespaces`
Default catalog of a specific account	Account ID	`GET /v1/catalogs/123456789012/namespaces`
Nested catalog	`catalog1:catalog2`	`GET /v1/catalogs/rmscatalog1:db1/namespaces`
Nested catalog of a specific account	`accountId:catalog1:catalog2`	`GET /v1/catalogs/123456789012:s3tablescatalog:bucket/namespaces`

The : notation for the default catalog is confusing at first glance, but it makes more sense once you think of it as "encoding the catalog hierarchy's separator as : instead of the path separator /." It's essentially using the free-form nature of the Iceberg REST spec's prefix to cram Glue's multi-catalog hierarchy into the URL.

Listing Namespaces (Default Catalog)

awscurl --service glue --region ap-northeast-1 \
  "https://glue.ap-northeast-1.amazonaws.com/iceberg/v1/catalogs/123456789012/namespaces"

Note

The default catalog can be expressed either as a single colon (:) or as the account ID. Here I'm using the catalogs/{accountId} form, which is also what showed up in overrides.prefix in the getConfig response above.

Result (excerpt):

{
  "namespaces": [
    ["hive"],
    ["icebergdb"]
  ]
}

What comes back is the familiar list of Glue databases — the response directly confirms the mapping "Glue database = Iceberg namespace." Notably, S3 Tables-backed databases under s3tablescatalog are not included here. The default catalog only covers Glue databases directly under the account; to see S3 Tables namespaces, you need to explicitly reach them via the nested catalog prefix, described below.

Incidentally, the Glue endpoint has the same single-level namespace restriction as S3 Tables. What looks like a multi-level structure isn't expressed by making namespaces deeper — it's expressed by nesting catalogs. That seems to be a consistent design choice on Glue's part.

Reading an S3 Tables Table Through the Glue Endpoint

This is the heart of today's investigation: reading the same table we read via the S3 Tables endpoint, this time through the Glue endpoint.

S3 Tables with Glue integration enabled is mounted under the s3tablescatalog federated catalog. In PyIceberg, you'd specify warehouse as 123456789012:s3tablescatalog/penguin-rest-test. Applying the earlier prefix-conversion rule, the path becomes:

awscurl --service glue --region ap-northeast-1 \
  "https://glue.ap-northeast-1.amazonaws.com/iceberg/v1/catalogs/123456789012:s3tablescatalog:penguin-rest-test/namespaces/analytics/tables/daily_sales"

Result (excerpt):

{
  "config": {
    "createdBy": "123456789012",
    "s3TableArn": "arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test/table/34c72c19-610d-4e5c-bee4-6ae9f093c583",
    "ownerAccountId": "123456789012",
    "metadata_location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3/metadata/00000-163a447a-c64d-44f9-8619-5db9561e549b.metadata.json",
    "format": "ICEBERG",
    "warehouse_location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3",
    "table_type": "ICEBERG"
  },
  "metadata": {
    "format-version": 2,
    "table-uuid": "9507ea2b-7a71-40a7-b6a9-1d6e633955e1",
    "current-schema-id": 0,
    "schemas": [
      {
        "type": "struct",
        "schema-id": 0,
        "fields": [
          { "id": 1, "name": "sales_date", "required": false, "type": "date" },
          { "id": 2, "name": "amount", "required": false, "type": "long" }
        ]
      }
    ],
    "current-snapshot-id": -1
  },
  "metadata-location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3/metadata/00000-163a447a-c64d-44f9-8619-5db9561e549b.metadata.json"
}

Note

The nested-catalog prefix notation used above (accountId:s3tablescatalog:bucketName) was derived from the conversion-rule table in the official documentation. Sending an actual request in this format resulted in correct routing, and both authorization and the underlying data checked out.

The metadata-location and table-uuid (9507ea2b-7a71-40a7-b6a9-1d6e633955e1) here match exactly what we retrieved via the S3 Tables endpoint earlier. Different entry point, same underlying entity.

The config contents are worth noting too. In this request — where we didn't set the x-iceberg-access-delegation header — config didn't contain S3 access credentials, only metadata like s3TableArn (the original S3 Tables ARN) and warehouse_location (the actual S3 path). As I'll cover below, one more configuration step was needed before credentials would actually come back.

For the same table: through the S3 Tables endpoint the path is /v1/{bucketArn}/namespaces/analytics/tables/daily_sales, while through the Glue endpoint it's /v1/catalogs/{accountId}:s3tablescatalog:{bucketName}/namespaces/analytics/tables/daily_sales. Physically they point to the same metadata.json, but the path structures are completely different because the two services have entirely different mental models of "what a catalog is." That, I think, is the essential difference between the two endpoints.

Access Control Differences (Lake Formation)

Accessing via the Glue endpoint requires Lake Formation grants in addition to IAM policies (glue:GetCatalog, glue:GetTable, etc.), because S3 Tables tables get registered as Lake Formation resources when Glue integration is enabled.

Further, for an external engine to read the actual data, you need to enable full-table access for external engines in Lake Formation, and allow the IAM role to call lakeformation:GetDataAccess. This mechanism is what issues the temporary credentials known as vended credentials.

In practice, though, this alone wasn't enough. It took explicitly registering the table bucket as a federated Lake Formation resource via aws lakeformation register-resource --with-federation, and setting up a trust policy so the Lake Formation service itself (lakeformation.amazonaws.com) could assume the IAM role used for issuing credentials, before vended credentials actually started being issued. If this registration step is missing, requests with the header described below (even with otherwise-correct IAM policies and Lake Formation grants) just come back with a credential-less config — only s3TableArn and warehouse_location — with no error at all, which makes it an easy thing to miss.

When PyIceberg hits the Glue endpoint, it attaches this header to the request:

x-iceberg-access-delegation: vended-credentials

Per the Iceberg REST spec, this header signals to the server that the client wants credential delegation. However — at least in this environment — as long as the federation registration was complete, calling LoadTable without this header still returned credentials the same way.

In other words, what actually determines whether credentials come back doesn't seem to be the presence of the header, but whether the Lake Formation federation registration is complete. The header is the spec-compliant way to signal intent, but at least in this environment it wasn't the deciding factor in the Glue endpoint's behavior.

Still, to verify things properly per spec, let's call LoadTable with the header attached:

# With the vended-credentials header
awscurl --service glue --region ap-northeast-1 \
  -H "x-iceberg-access-delegation: vended-credentials" \
  "https://glue.ap-northeast-1.amazonaws.com/iceberg/v1/catalogs/123456789012:s3tablescatalog:penguin-rest-test/namespaces/analytics/tables/daily_sales"

Result (excerpt):

{
  "metadata-location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3/metadata/00000-163a447a-c64d-44f9-8619-5db9561e549b.metadata.json",
  "metadata": { "...": "..." },
  "config": {
    "s3.access-key-id": "(masked)",
    "s3.secret-access-key": "(masked)",
    "s3.session-token": "(masked)",
    "s3.session-token-expires-at-ms": "1783651228000",
    "s3TableArn": "arn:aws:s3tables:ap-northeast-1:123456789012:bucket/penguin-rest-test/table/34c72c19-610d-4e5c-bee4-6ae9f093c583",
    "warehouse_location": "s3://34c72c19-610d-4e5c-d8d1qwx3db1p3tbr8g9jch9n6e14sapn1b--table-s3"
  }
}

s3.access-key-id, s3.secret-access-key, and s3.session-token are actually present here — these are the vended credentials. Along with s3.session-token-expires-at-ms (session expiration in epoch milliseconds), this confirms from the response itself that the catalog is issuing a temporary storage key directly.

The S3 Tables endpoint, meanwhile, has no notion of Lake Formation at all — authorization there is handled entirely via s3tables:* IAM actions, and resource-based policies on the table bucket are also available.

Same underlying table, but the authorization model changes depending on which entry point you go through — something worth keeping in mind during access-design work. Incidentally, exactly how far you can deliberately vary Lake Formation permissions and what happens as a result (does removing a grant actually deny access, etc.) feels like enough material for its own article, so I'll leave that for another time.

Summarizing the Differences Between the Two Endpoints

Here's what actually turned up from hitting both endpoints directly.

Basic Structure

Item	Glue endpoint	S3 Tables endpoint
Endpoint	`glue.<region>.amazonaws.com/iceberg`	`s3tables.<region>.amazonaws.com/iceberg`
Signing service name	`glue`	`s3tables`
`warehouse` value	Glue catalog ID	Table bucket ARN
prefix	`/catalogs/{catalog}` (hierarchy encoding)	URL-encoded bucket ARN
Catalog scope	Account-wide (multi-catalog hierarchy)	Single table bucket
Storage targeted	General-purpose S3 Iceberg + S3 Tables	S3 Tables only
Access control	IAM + Lake Formation (hybrid possible)	s3tables IAM actions only
Credential vending	Vended credentials (requires `register-resource --with-federation`)	None (uses the caller's own credentials)

Where to Use Which, and Gotchas

The official documentation gives guidance on when to use each endpoint:

If you only need basic read/write access to a single table bucket: the S3 Tables endpoint
If you need to integrate multiple catalog sources, or need centralized governance and fine-grained access control via Lake Formation: the Glue endpoint

From actually hitting both directly, here are a few things I'd flag:

Getting the signing service name wrong (glue / s3tables) results in a 403. This is exactly the kind of mistake a wrong rest.signing-name in your client config produces.
Same table, different authorization model depending on which endpoint you go through (whether Lake Formation is involved). Be explicit up front about which path you're connecting through when designing permissions.
CTAS isn't supported on either endpoint. You can work around this by splitting it into CREATE TABLE + INSERT INTO.
dropTable on the S3 Tables endpoint requires purge=true. Depending on your Spark version, DROP TABLE PURGE can end up sending purge=false anyway — in that case, you'll need to delete via the native DeleteTable API instead.

Looking Ahead (Some Personal Thoughts)

Hitting both endpoints directly left me with the impression that each service's design philosophy shows through in how it uses the "free-form" parts of the Iceberg REST Catalog spec.

The S3 Tables endpoint — with its prefix being the bucket ARN, authorization handled through s3tables IAM actions, and CloudTrail logging it as a native API — looks, from this angle, like a thin translation layer over the storage API.

The Glue endpoint, on the other hand, encodes a catalog hierarchy into the prefix, has Lake Formation stepping into authorization, and (when configured correctly) issues temporary storage keys via vended credentials. Both general-purpose S3-based Iceberg and S3 Tables are visible through the same entry point — which lines up with something I speculated in an earlier article, that Glue Data Catalog may be evolving into a metadata plane for AWS as a whole. That same idea seems to show up directly in how the REST API's paths are designed.

The fact that authentication is SigV4 rather than the OAuth the spec assumes is another data point suggesting that how a vendor uses the "free" parts of a standard spec reveals something about its design thinking — probably not unique to the Iceberg REST Catalog.

One more thing I noticed while reading the spec: the latest version defines server-side scan planning endpoints (planTableScan and friends), where the server does the work of building a scan plan. Glue already has a separate extension endpoint (https://glue.<region>.amazonaws.com/extensions) that independently offers server-side scan planning for Redshift Managed Storage — so it looks like a capability that existed as a proprietary extension is now being absorbed into the standard spec. That's an interesting trend — a proprietary extension leading the way before the standard catches up — and I'd like to compare Glue's extension API against the standard plan-related endpoints in a future article.

Snowflake's Catalog-Linked Database (CATALOG_API_TYPE = AWS_GLUE) connects to exactly this Glue endpoint, so seeing how this API gets called from the Snowflake side is another topic I want to dig into going forward.

Conclusion

In this article, I hit the Iceberg REST Catalog directly to sort out the differences between the Glue Data Catalog and S3 Tables endpoints. It took some effort, but I came away with a clearer understanding and a few new insights.

To summarize:

Design philosophy: even though both implement the same Iceberg REST Catalog spec, Glue is an "entry point to multiple catalogs" while S3 Tables is an "entry point to a single table bucket" — a difference visible just from the URL paths.
Prefix design: the spec's free-form {prefix} is used by Glue to encode a catalog hierarchy (/catalogs/{catalog}), and by S3 Tables to encode the table bucket ARN.
Authentication: both use IAM SigV4. Getting the signing service name wrong (glue / s3tables) results in a 403.
Authorization model: for the same table, the Glue endpoint goes through IAM + Lake Formation, while the S3 Tables endpoint uses only s3tables IAM actions — the model changes depending on the path.
Vended credentials: temporary credentials aren't issued just by attaching the x-iceberg-access-delegation: vended-credentials header — you also need to complete resource registration via register-resource --with-federation.
Debugging tips: inspecting the actual CanonicalRequest and headers with awscurl -v, using CloudTrail's errorCode to triage error types, and pre-checking URL assembly with echo were all useful, unglamorous techniques.

Day to day, relying on PyIceberg's or Spark's abstractions is more than enough. But looking at the raw HTTP requests once gives you a clearer mental map between each client configuration item and "that part of that request," which raises the resolution you get when triaging connection errors.

I hope this article is useful to anyone trying to understand how the Iceberg REST Catalog actually works.

Does Amazon S3 Tables Replace AWS Glue Data Catalog? Understanding Their Relationship

Aki — Wed, 01 Jul 2026 14:16:20 +0000

Original Japanese article: S3 TablesはGlue Data Catalogを置き換えるのか考えてみた

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

When I first started exploring Amazon S3 Tables, one question immediately came to mind:

"Does this service eventually replace AWS Glue Data Catalog?"

Perhaps not everyone has the same impression. However, because S3 Tables provides its own Iceberg REST Catalog endpoint and can create and manage namespaces and tables without integrating with Glue Data Catalog, I couldn't help but wonder about it.

The official documentation uses the term "integration", which suggests that the relationship is more nuanced than a simple replacement. Still, it can be difficult to understand how these two services actually fit together.

In this article, I'd like to organize the relationship between Amazon S3 Tables and AWS Glue Data Catalog based on the official AWS documentation.

To summarize upfront:

Amazon S3 Tables does not replace AWS Glue Data Catalog. Instead, it should be viewed as a new managed table storage service that provides its own Iceberg REST Catalog while also integrating with Glue Data Catalog.

Understanding the Roles of S3 Tables and Glue Data Catalog

To begin with, Amazon S3 Tables is not a replacement for Glue Data Catalog.

At the same time, it is not entirely accurate to think of S3 Tables as "just storage."

S3 Tables provides its own Iceberg REST Catalog endpoint:

https://s3tables.<region>.amazonaws.com/iceberg

Even without integrating with Glue Data Catalog, you can create, list, delete, and read/write namespaces and tables directly through this endpoint.

The official documentation describes this standalone endpoint as being suitable for scenarios where you only need basic read/write access to a single table bucket. For other scenarios, AWS recommends using the Glue Iceberg REST endpoint, which provides:

Integrated table management
Centralized governance
Fine-grained access control

The roles of the two services can be summarized as follows:

Service	Role
Amazon S3 Tables	A storage layer that stores Iceberg table data and metadata while also providing an Iceberg REST Catalog for a single table bucket
AWS Glue Data Catalog	A centralized metadata catalog for AWS tables and databases, including S3 Tables, providing unified access and governance across analytics services such as Athena and Redshift

In other words, the central catalog for governance across AWS analytics services is still Glue Data Catalog.

S3 Tables is simply one of the catalog sources that can be integrated into Glue Data Catalog as a federated catalog.

Considering recent features such as S3 Tables, Catalog Federation, S3 Metadata, and S3 Annotations, all of which bring various metadata sources under Glue Data Catalog, the role of Glue Data Catalog may become even more important in the future.

If you only need to manage a single table bucket, the standalone REST Catalog provided by S3 Tables may be sufficient.

However, if you need to work across multiple analytics services or multiple catalog sources, using Glue Data Catalog as the entry point is likely to be much easier operationally.

Understanding Federated Catalogs

By integrating S3 Tables with Glue Data Catalog, you can use a single catalog to discover and query data in Amazon S3 data lakes and even join that data with S3 Tables.

A federated catalog allows users to access metadata through Glue Data Catalog without needing to know where that metadata is actually stored.

From a user's perspective, S3 Tables integration works in a similar way—you can access it as another catalog within Glue Data Catalog without needing to know where the underlying metadata physically resides.

The integration maps S3 Tables resources into Glue Catalog objects as follows:

An S3 table bucket becomes a Data Catalog catalog.
An S3 namespace becomes an AWS Glue database.
An S3 table becomes an AWS Glue table.

When integration is enabled through the console, AWS automatically creates another layer on top called:

s3tablescatalog

When you integrate S3 Tables with Data Catalog through the Amazon S3 console, AWS creates a federated catalog named s3tablescatalog, which acts as the parent catalog for all existing and future S3 table buckets in that account and Region.

From the perspective of query engines, the architecture looks like this:

Athena / Redshift / Glue ETL
              │
              ▼
      Glue Data Catalog
              │
              ├── Traditional Glue Catalog Tables
              ├── S3 Tables (via s3tablescatalog)
              └── External Catalogs (via Catalog Federation)

Regardless of where the metadata actually resides, all of these sources appear as tables under Glue Data Catalog.

The hierarchy inside S3 Tables looks like this:

s3tablescatalog (federated catalog)
        └── analytics-bucket (child catalog = S3 table bucket)
                └── sales (database = S3 namespace)
                        └── transactions (table = S3 table)

For example, if you have a table bucket named analytics-bucket containing a namespace called sales and a table called transactions, it can conceptually be represented in Glue Data Catalog as:

s3tablescatalog/analytics-bucket/sales/transactions

In Athena SQL, the same table is referenced as:

"s3tablescatalog/analytics-bucket"."sales"."transactions"

The important thing to remember is that the parent catalog layer, s3tablescatalog, sits in front of the table bucket.

Note:
The four-level hierarchy described above applies to same-account scenarios.
In cross-account scenarios, individual S3 table buckets must be mounted manually into Data Catalog, resulting in a three-part hierarchy.

Trying It Out

Let's verify how an S3 table bucket becomes visible from Glue Data Catalog and how it can be queried from Athena.

Creating a Table Bucket

Create a table bucket in the console and enable the "Enable integration" checkbox.

If you're using the CLI:

aws s3tables create-table-bucket

At this point, integration with Glue Data Catalog is automatically configured.

Querying from Athena

To query the table from Athena, specify:

s3tablescatalog/<table-bucket-name>

as the catalog.

Viewing It from Glue Data Catalog

When you open the Glue Data Catalog console, you'll see the table bucket under s3tablescatalog, followed by its namespaces and tables.

You can also view the schema information directly.

One particularly nice aspect of this integration is that, without any special configuration, you can access S3 Tables data directly from the familiar Athena query editor and Glue Data Catalog console.

Existing Access Control Mechanisms Continue to Work

From an access management perspective, the existing Glue Data Catalog and Lake Formation mechanisms continue to work.

Data Catalog supports two access control modes for S3 Tables integration.

Mode	Description
IAM Access Control	Controls access to both S3 Tables and Data Catalog through IAM policies
AWS Lake Formation Access Control	Uses Lake Formation permissions in addition to Glue IAM permissions and supports database-, table-, column-, and row-level security

One important detail concerns credentials when using Lake Formation.

If a registered role is configured and credential vending is enabled, principals do not need direct S3 Tables IAM permissions.

This is because Lake Formation issues credentials on behalf of the principal using the registered role.

I have another article covering AWS Lake Formation in detail if you're interested.
Organizing How to Use AWS Lake Formation

Because you can migrate between access control modes as requirements evolve, a practical approach might be to start with IAM-only permissions and later move to Lake Formation for finer-grained control.

If Nothing Is Being Replaced, What Actually Changed?

The position of Glue Data Catalog itself has not changed.

What has changed is the operational layer around data management and table maintenance.

Traditionally, Iceberg on S3 consisted of:

A general-purpose S3 bucket
Glue Data Catalog
Table maintenance mechanisms

Even in the traditional architecture, Glue Table Optimizer can already manage:

Compaction
Snapshot retention
Orphan file deletion

Therefore, the major difference between traditional Iceberg on S3 and S3 Tables is not necessarily the existence of these features, but rather that they are enabled by default.

Item	Traditional Iceberg on S3	S3 Tables
Storage	General-purpose S3 bucket	Dedicated table bucket
Catalog	Glue Data Catalog	Native Iceberg REST Catalog or Glue Data Catalog
Compaction	Glue Table Optimizer (manual enablement)	Managed and enabled by default
Snapshot retention	Glue Table Optimizer (manual enablement)	Managed and enabled by default
Orphan file deletion	Glue Table Optimizer (manual enablement)	Managed and enabled by default

Of course, S3 Tables also provides dedicated resource types, optimized pricing models, and specialized APIs.

However, one of the biggest differences is the reduction in onboarding effort thanks to these capabilities being enabled from the start.

From the perspective of Glue Data Catalog, S3 Tables is simply another catalog source integrated as a federated catalog.

It does not require replacing existing crawler-based Glue Data Catalog environments. Both approaches can coexist.

When Should You Use It?

S3 Tables integration seems particularly well suited for:

Building new Iceberg tables with maintenance enabled by default.
Querying across multiple table buckets from Athena or Redshift.
Leveraging Lake Formation's fine-grained access controls.

Things to keep in mind:

Cross-account scenarios require manual mounting.
Query engines require the s3tablescatalog parent catalog path.
Existing crawler-based Glue Catalog environments do not need to be migrated to S3 Tables.

My Thoughts on the Future

As more specialized storage services like S3 Tables emerge, I believe Glue Data Catalog may evolve beyond being merely a Hive Metastore-compatible catalog and become more of an AWS-wide metadata hub.

Catalog Federation already allows external catalogs such as Snowflake Horizon Catalog and Databricks Unity Catalog to connect under Glue Data Catalog.

This suggests a future where:

The data can reside anywhere, but the catalog entry point is centralized in Glue Data Catalog.

The recently announced Amazon S3 Annotations feature also seems to support this direction.

S3 Annotations allows rich, mutable, and queryable metadata to be attached directly to S3 objects.

When annotation tables are enabled, S3 automatically indexes those annotations into fully managed Apache Iceberg tables that can be queried using Athena.

Interestingly, the official examples reference:

"s3tablescatalog/aws-s3"."b_my_media_bucket"."annotation"

which means that the s3tablescatalog hierarchy is now appearing outside the context of S3 Tables itself.

While S3 Tables turns data into Iceberg tables and integrates them into Glue Data Catalog, S3 Annotations appears to do something similar for object metadata.

AWS has not explicitly stated this direction.

However, when looking at S3 Tables, S3 Metadata, S3 Annotations, and Catalog Federation together, Glue Data Catalog increasingly looks like an AWS-wide metadata plane rather than simply a Hive Metastore-compatible service.

It feels as though AWS is moving toward a future where both data and metadata can be accessed through a common Iceberg-based access model.

If this trend continues, Glue Data Catalog may become even more important as the metadata plane for AWS data services.

The arrival of S3 Tables does not diminish the importance of Glue Data Catalog.

If anything, it clarifies its role as the hub that integrates multiple data sources and metadata sources.

Conclusion

To summarize:

Amazon S3 Tables does not replace AWS Glue Data Catalog.
S3 Tables integrates into Glue Data Catalog as a federated catalog.
Glue Data Catalog may become even more important as a hub that integrates multiple metadata sources.
The catalog hierarchy consists of:

Federated Catalog (s3tablescatalog)
        ↓
Child Catalog (table bucket)
        ↓
Database (namespace)
        ↓
Table

Access control can be implemented using either IAM or Lake Formation and migrated later if necessary.
The major change introduced by S3 Tables is that storage, metadata management, and table maintenance are now provided in a more integrated and managed manner through table buckets.

For building new AWS-native data lakes or lakehouses, S3 Tables is becoming a compelling option.

However, this is not because it replaces Glue Data Catalog.

Rather, it provides a more managed way to operate Iceberg tables on top of a metadata foundation centered around Glue Data Catalog.

I hope this article helps clarify the relationship between Amazon S3 Tables and AWS Glue Data Catalog.

Track Apache Iceberg Schema Changes in AWS Glue Data Catalog with aws glue get-table-versions

Aki — Mon, 15 Jun 2026 00:08:31 +0000

Original Japanese article: Iceberg × Glue Data Catalogのスキーマ変更履歴をaws glue get-table-versionsで確認する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

As Apache Iceberg adoption continues to grow in AWS-based lakehouse architectures, schema evolution has become one of its most valuable features.

At the same time, questions like the following inevitably arise:

When did the schema change?
Which columns were added or removed?
Who made the change?

Although you can view historical schema versions from the AWS Glue console, investigating these details can be cumbersome.

This is where aws glue get-table-versions becomes useful.

When your Apache Iceberg tables are managed through AWS Glue Data Catalog, this command allows you to retrieve schema change history over time.

In this article, I'll walk through the basics of get-table-versions, show how to extract column-level differences with jq, and explain how to identify the person who made a change by combining the results with CloudTrail.

What is `get-table-versions`?

aws glue get-table-versions is an AWS CLI command that retrieves historical versions of a table registered in AWS Glue Data Catalog.

Every time a Glue table definition is updated, a new VersionId is created. Each version stores the schema definition, partition information, table parameters, and other metadata.

In addition to schema tracking, Iceberg-specific parameters such as metadata_location are also recorded, making this command useful for Iceberg operational management.

Isn't the Console Enough?

You can actually view previous schema versions from the AWS Glue console by selecting an older version from the version dropdown in the upper-right corner of the table page.

However, the console only allows you to inspect one version at a time.

It does not show differences between versions, making it difficult to answer questions such as:

Which columns were added or removed?
In which version did the change occur?
When exactly was the schema updated?

If you need to compare multiple versions or quickly investigate schema-related incidents, using the CLI is far more efficient.

Basic Command Syntax

aws glue get-table-versions \
  --database-name <database-name> \
  --table-name <table-name> \
  --no-paginate \
  --region ap-northeast-1

Using --no-paginate retrieves all versions in a single request.

The response contains a TableVersions array. Each element includes:

VersionId
Table (full schema definition)
UpdateTime (timestamp when the Glue table definition was updated)

Example Output

{
  "Table": {
    "Name": "flights_1m",
    "DatabaseName": "icebergdb",
    "UpdateTime": "2026-06-12T05:21:18+00:00",
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "fl_date",
          "Type": "date",
          "Parameters": {
            "iceberg.field.current": "true",
            "iceberg.field.id": "1",
            "iceberg.field.optional": "true"
          }
        }
      ]
    },
    "Parameters": {
      "metadata_location": "s3://<your-bucket>/warehouse/flights_1m/metadata/00002-....metadata.json",
      "previous_metadata_location": "s3://<your-bucket>/warehouse/flights_1m/metadata/00001-....metadata.json",
      "table_type": "ICEBERG"
    },
    "VersionId": "5"
  }
}

Characteristics of Iceberg Tables

Compared to standard Glue tables, Iceberg tables have several notable characteristics.

First, you may notice that VersionId values become surprisingly large.

Glue VersionId values are not the same as Iceberg Snapshot IDs or commit counts. They increment whenever the Glue table definition stored in the catalog is updated.

Because Iceberg frequently updates metadata_location, Glue table definitions are also updated regularly, causing VersionId to increase much more rapidly than expected.

In one of my test environments, a table had already reached VersionId 318.

However, most of those versions were created by metadata updates associated with data writes rather than actual schema changes.

Another notable characteristic is the presence of iceberg.field.id within each column's Parameters.

This field represents Iceberg's internal column ID, which enables schema evolution features such as column renames without breaking data mapping.

The table-level Parameters section also contains:

metadata_location
previous_metadata_location

These point to Iceberg metadata files stored in Amazon S3.

Because Glue VersionId values correspond to Iceberg metadata updates, you can trace these files for deeper historical analysis when necessary.

Warning

AWS Glue Data Catalog has service quotas for the number of stored table versions.

In Iceberg environments, it's possible to hit these limits and encounter ResourceNumberLimitExceededException.

Consider periodically removing old versions or using the SkipArchive option of UpdateTable to reduce version growth.

Viewing Schema Change History

List Columns by Version

Let's start by displaying the update timestamp and column list for each version.

Using jq, we can extract VersionId, UpdateTime, and the column names:

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq -r '
    .TableVersions[]
    | {
        version: .VersionId,
        updated: .Table.UpdateTime,
        columns: [.Table.StorageDescriptor.Columns[].Name]
      }
    | "\(.version)\t\(.updated)\t\(.columns | join(", "))"
  '

Example output:

5    2026-06-12T05:21:18+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, arr_time
4    2026-06-12T05:20:48+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, double
3    2026-06-12T05:20:11+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time
2    2025-09-01T21:27:05+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, arr_time

Note that VersionId values are not necessarily consecutive.

In this example they happen to be 2 → 3 → 4 → 5, but in production environments they may reach hundreds or even thousands.

Since Glue VersionId values do not directly correspond to Iceberg commits, you should not use them alone to estimate the number of schema changes.

Compare Differences Between Versions

To identify added and removed columns between adjacent versions, you can compare column arrays using the jq array difference (-) operator.

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq '
    [ .TableVersions[] | {
        v: .VersionId,
        cols: [.Table.StorageDescriptor.Columns[].Name]
      }
    ]
    | sort_by(.v | tonumber)
    | . as $sorted
    | range(1; length)
    | {
        from: $sorted[.].v,
        added:   ($sorted[.].cols - $sorted[. - 1].cols),
        removed: ($sorted[. - 1].cols - $sorted[.].cols)
      }
  '

Example output:

{ "from": "3", "added": [],           "removed": ["arr_time"] }
{ "from": "4", "added": ["double"],   "removed": [] }
{ "from": "5", "added": ["arr_time"], "removed": ["double"] }

This shows that:

arr_time was removed between v2 and v3.
double was added between v3 and v4.
double was removed and arr_time was restored between v4 and v5.

For full transparency, double was not generated automatically by Glue.

It was actually a mistake I made during testing. I intended to add the arr_time column but accidentally entered the data type name double as the column name.

The issue was corrected in v5, but it serves as a useful demonstration that both mistakes and subsequent fixes are preserved in the version history.

Find Versions Containing a Specific Column

If you need to answer a question such as:

"Which versions contained this column?"

you can use select and contains in jq:

TARGET_COLUMN="COLUMN_NAME"

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq --arg col "$TARGET_COLUMN" '
    .TableVersions[]
    | select(
        .Table.StorageDescriptor.Columns
        | map(.Name)
        | contains([$col])
      )
    | {version: .VersionId, updated: .Table.UpdateTime}
  '

Example output:

{
  "version": "4",
  "updated": "2026-06-12T05:20:48+00:00"
}

This confirms that the column double existed only in version 4.

Identify Who Changed the Schema with CloudTrail

While get-table-versions tells you when a schema changed, it does not tell you who made the change.

To identify the responsible user or role, you can correlate the schema update time with CloudTrail UpdateTable events.

Once you've identified the relevant timestamp, search CloudTrail around that period:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=UpdateTable \
  --start-time "2026-06-12T05:20:18+00:00" \
  --end-time "2026-06-12T05:21:18+00:00" \
  --region ap-northeast-1 \
| jq '.Events[] | {time: .EventTime, user: .Username, detail: .CloudTrailEvent | fromjson | .requestParameters}'

Example output:

{
  "time": "2026-06-12T05:21:18+00:00",
  "user": "XXXXX",
  "detail": {
    "catalogId": "123456789012",
    "databaseName": "icebergdb",
    "tableInput": {
      "name": "flights_1m"
    }
  }
}

The user field identifies the IAM user or role responsible for the update.

Additionally, detail.tableInput contains the updated table definition, allowing you to inspect the actual schema change directly from CloudTrail.

In many cases, reviewing UpdateTable events is sufficient.

However, depending on the tool or workflow being used, changes may also appear as:

CreateTable
BatchCreatePartition
Lake Formation-related events

If you cannot find the expected event, try expanding your search criteria.

Conclusion

In this article, we explored how to use aws glue get-table-versions to track schema changes in AWS Glue Data Catalog.

With this approach, you can:

Review schema history chronologically
Compare column additions and removals between versions
Identify which versions contained specific columns
Determine who made a change by correlating with CloudTrail

Glue Data Catalog is often viewed simply as a metadata catalog for current table definitions.

However, by leveraging Table Versions, it can also serve as a lightweight audit mechanism.

Because schema evolution is a fundamental feature of Apache Iceberg, understanding how to answer questions such as:

When did the schema change?
What changed?
Who changed it?

can be extremely valuable during troubleshooting and day-to-day operations.

I hope this article helps anyone managing Apache Iceberg tables with AWS Glue Data Catalog.

Organizing How to Use AWS Lake Formation

Aki — Mon, 08 Jun 2026 02:24:47 +0000

Original Japanese article: AWS Lake Formationの使い方について整理してみる

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

Previously, I wrote an article titled Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies.
In that article, I mentioned that "AWS Lake Formation is necessary to complement data governance" but did not go into detail because it was outside the scope of the article.

This time, I'd like to organize my thoughts on Lake Formation, covering everything from the fundamentals to practical usage patterns.

Lake Formation is often perceived as a service that is "somewhat difficult" or "unnecessary because IAM is enough."
However, once you start implementing proper access control for a data lake, the necessity of Lake Formation becomes much clearer.

I hope this article helps you evaluate whether Lake Formation is worth adopting in your environment.

What Is Lake Formation?

AWS Lake Formation is a service that provides access management and governance for data lakes.

It allows you to centrally manage who can access which data and at what level.
One of its key strengths is the ability to manage access controls consistently across multiple AWS services such as Athena, Glue, and Redshift Spectrum.

Although they are often confused, Lake Formation and Glue Data Catalog serve different purposes.

Service	Role
Glue Data Catalog	A technical catalog that manages metadata such as schemas and partitions
Lake Formation	A governance layer that manages access permissions for data registered in the Glue Data Catalog

Amazon S3 (Actual Data)
        ↓
Glue Data Catalog (Metadata Management)
        ↓
Lake Formation (Access Control)
        ↓
Athena / Glue Job / Redshift Spectrum

In other words, data resides in S3, Glue Data Catalog manages metadata, and Lake Formation provides access control on top of that metadata layer.

How Lake Formation Differs from IAM

Isn't IAM Enough?

When managing a data lake on S3 using IAM alone, several challenges emerge:

Granularity limitations: IAM primarily operates at the bucket or prefix level, making table-, column-, and row-level access control difficult.
Operational complexity: As users and roles increase, S3 bucket policies and IAM policies become increasingly difficult to manage.
Cross-account sharing: Implementing data sharing across AWS accounts using only IAM can lead to complicated designs.
Limited visibility for auditing: It is difficult to easily understand who can access which tables.

Typical examples include:

More than ten Athena users need different levels of access, making permission management increasingly complicated.
Different departments should see different subsets of data. For example, the sales department should only see Eastern Japan sales, while executives can see all data.
Personally identifiable information (PII) such as email addresses and credit card numbers should be hidden from analysts.
Data needs to be shared with another AWS account.

Lake Formation addresses these challenges.

What Lake Formation Solves

With Lake Formation, you can implement:

Fine-grained table-, column-, and row-level access control
Permission management at the Glue Data Catalog database and table level
Tag-based access control (LF-TBAC) for large-scale environments
Cross-account data sharing through AWS RAM
Centralized auditing through CloudTrail integration

The Relationship Between IAM and Lake Formation

Lake Formation does not replace IAM; it works as an additional layer on top of IAM.

When a query is executed (for example, through Athena), access is granted only if both conditions are satisfied:

IAM Permission
        AND
Lake Formation Permission
        ↓
Access Allowed

Even if permissions are granted in Lake Formation, access is denied if IAM blocks it.

Likewise, even if IAM allows access, the request is denied if the corresponding Lake Formation permissions are missing.

Understanding this "AND" relationship is the foundation of permission design.

Lake Formation Permission Model

Lake Formation permissions are managed across multiple levels.

Level	Target	Example Permissions
Data Lake Administrator	Entire Lake Formation environment	Full permissions
Database Level	Glue Data Catalog database	CREATE TABLE, DROP
Table Level	Individual table	SELECT, INSERT, ALTER
Column Level	Specific columns within a table	SELECT on selected columns
Row Level	Rows matching specific conditions	SELECT on filtered rows

Permissions can be granted or revoked through the console, CLI, or SDK.

# Example: Grant SELECT permission on a table
aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
  --permissions SELECT \
  --resource '{
    "Table": {
      "DatabaseName": "mydb",
      "Name": "sales_table"
    }
  }'

Column-Level and Row-Level Access Control

One of Lake Formation's strongest capabilities is fine-grained access control beyond the table level.

Both column-level and row-level security are implemented using a mechanism called Data Filters.
You create Data Filters in the console and reference them when granting permissions.

Column-Level Security

Access can be restricted to specific columns.

Suppose the customer table contains the following columns:

customer_id	name	email	credit_card	purchase_amount

You could allow analysts to access only customer_id, name, and purchase_amount, while hiding email and credit_card.

This can be achieved simply by specifying included or excluded columns in a Data Filter.
Excluded columns will not appear in Athena query results.

Row Filters

Row-level filters allow access only to rows matching specific conditions.

Filter expressions are written using PartiQL WHERE-clause syntax.

For example, if the sales table contains a region column and the Eastern Japan team should only see rows where region = 'east', you can create the following Data Filter:

aws lakeformation create-data-cells-filter \
  --table-data '{
    "TableCatalogId": "123456789012",
    "DatabaseName": "mydb",
    "TableName": "sales",
    "Name": "east-region-filter",
    "RowFilter": {
      "FilterExpression": "region = '\''east'\''"
    },
    "ColumnWildcard": {}
  }'

Combining column filters and row filters enables cell-level security, where users can access only specific columns within specific rows.

Data Filter Limitations

According to the official documentation:

Up to 100 filters per principal
array and map types are not supported in filter expressions (struct types can be used in row filters)
Cell-level security does not support nested columns, views, or resource links
Cell-level security is available in all regions when using Athena Engine Version 3 or Redshift Spectrum

Common Use Cases

Protecting PII such as email addresses and credit card numbers
Restricting business data by department or geographic region
Compliance requirements for regulated data

Tag-Based Access Control (LF-TBAC)

As the number of databases and tables grows, managing permissions table by table becomes increasingly difficult.

LF-TBAC (Lake Formation Tag-Based Access Control) addresses this problem.

What Are LF-Tags?

LF-Tags are key-value tags unique to Lake Formation.

They are separate from both S3 resource tags and IAM tags and are managed independently within Lake Formation.

aws lakeformation create-lf-tag \
  --tag-key "sensitivity" \
  --tag-values '["public", "internal", "confidential"]'

Tagging Resources and Mapping Permissions

LF-Tags can be assigned to databases, tables, and columns.

aws lakeformation add-lf-tags-to-resource \
  --resource '{"Table": {"DatabaseName": "mydb", "Name": "sales"}}' \
  --lf-tags '[{"TagKey": "sensitivity", "TagValues": ["internal"]}]'

Permissions are then granted based on tags rather than table names.

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
  --permissions SELECT \
  --resource '{
    "LFTagPolicy": {
      "ResourceType": "TABLE",
      "Expression": [{"TagKey": "sensitivity", "TagValues": ["public", "internal"]}]
    }
  }'

This grants SELECT access to all tables tagged with either sensitivity=public or sensitivity=internal.

When new tables are created, simply assigning the appropriate LF-Tag automatically applies the correct permissions.

Benefits in Large-Scale Environments

In environments with dozens or hundreds of tables, table-by-table permission management becomes unrealistic.

LF-TBAC enables a simpler model:

Roles can access data with specific tags.

However, tag design should be carefully planned from the beginning.
Defining categories such as sensitivity, domain, and owner early on can save significant effort later.

Integration with Glue Data Catalog

Lake Formation works closely with Glue Data Catalog.

Glue Data Catalog manages metadata, while Lake Formation governs access to that metadata.
Together they enable secure sharing and consumption of data stored in S3.

How Lake Formation Works with Data Catalog

When Lake Formation is enabled, access to Glue Data Catalog is routed through Lake Formation authorization checks.

This means that access to metadata itself—such as table definitions—can also be controlled.

Granting Lake Formation Permissions to Glue Jobs

When a Glue Job accesses data governed by Lake Formation, permissions must be granted not only through IAM but also through Lake Formation.

This is a common pitfall.

A typical issue is:

IAM permissions look correct, but the Glue Job still cannot read data.

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/glue-job-role \
  --permissions SELECT \
  --resource '{
    "Table": {"DatabaseName": "mydb", "Name": "source_table"}
  }'

Cross-Account Sharing

Lake Formation supports cross-account data sharing through AWS RAM (Resource Access Manager).

Users in the target account can query shared tables directly from their own Athena environment.

Because Lake Formation permissions—including column and row filters—remain enforced, scenarios such as sharing data while excluding sensitive columns are supported.

To use cross-account sharing, the Data Catalog Cross Account Version setting must be configured to Version 3 or later.

Version 3 enables direct sharing with IAM principals in other accounts.
Version 4 adds support for hybrid access mode in cross-account scenarios.

Integration with Athena and Redshift Spectrum

Authorization Flow During Query Execution

When Athena accesses a Lake Formation-managed table:

A user executes a query in Athena.
Athena requests table metadata from Glue Data Catalog.
Lake Formation validates permissions.
If authorized, access to data in S3 is allowed.
Column and row filters are applied before results are returned.

This enables fine-grained access control without modifying S3 bucket policies.

Redshift Spectrum Integration

Since Redshift Spectrum also relies on Glue Data Catalog, Lake Formation permissions are enforced there as well.

This makes it easier to maintain consistent access control across Athena and Redshift Spectrum.

Adoption Challenges and Realistic Operations

Existing Environments: IAMAllowedPrincipals and Hybrid Access Mode

To preserve backward compatibility, Lake Formation grants the IAMAllowedPrincipals group Super permissions on existing Data Catalog resources by default.

In this state, access is effectively controlled by IAM alone, and Lake Formation's fine-grained controls are not enforced.

To fully leverage Lake Formation, these permissions must eventually be removed and replaced with explicit Lake Formation permissions.

However, switching everything at once can break existing workloads.

This is where Hybrid Access Mode becomes useful.

When registering S3 locations, Hybrid Access Mode allows selected principals to opt into Lake Formation authorization while other principals continue using IAM-only access.

This approach minimizes risk and enables gradual migration.

Personally, I believe this is the most practical approach for existing environments.

Common Pitfalls

Forgetting Lake Formation Permissions for Glue Jobs

As mentioned earlier, forgetting to grant Lake Formation permissions to Glue Job roles prevents ETL jobs from reading or writing data.

Many "it should work but doesn't" permission issues ultimately trace back to this.

I've forgotten it myself a few times and ended up scrambling to find the root cause.

Interaction with S3 Bucket Policies

Lake Formation does not override S3 bucket policies.

Even if access is granted in Lake Formation, requests are denied if the bucket policy blocks them.

When adopting Lake Formation, bucket policies must be designed to allow access through Lake Formation-authorized service roles.

Maintaining consistency among IAM, Lake Formation, and S3 bucket policies is critical.

Changing the design later can become painful, so it's worth thinking through carefully from the beginning.

Configuring Data Lake Administrators

When enabling Lake Formation for the first time, at least one Data Lake Administrator must be configured.

Relying on a single administrator can become an operational bottleneck, so I recommend assigning multiple administrators.

Athena Workgroups

When Athena Workgroups are used together with Lake Formation, behavior may vary depending on Workgroup configuration.

In particular, don't forget to grant permissions to the S3 bucket used for query results.

This is another thing I occasionally forget myself.

Incremental Adoption Strategy

For new environments, enabling Lake Formation from the start is usually the best option.

For existing environments, a phased approach tends to work better.

I've done this before, and while it's certainly possible, it's somewhat tedious.
If you're building a new environment, enabling Lake Formation from day one can save you trouble later.

Step 1: Gradual Opt-In with Hybrid Access Mode

Register S3 locations using Hybrid Access Mode
Opt in selected principals
Keep IAM-only access for others
Monitor access through CloudTrail

Step 2: Use Lake Formation for New Tables

Manage permissions for newly created tables through Lake Formation
Leave existing tables under IAMAllowedPrincipals

Step 3: Migrate Existing Tables

Gradually revoke IAMAllowedPrincipals permissions
Replace them with Lake Formation permissions
Validate behavior after each migration step

Where Lake Formation Excels—and Where It Doesn't

Lake Formation is particularly valuable for:

Fine-grained table-, column-, and row-level access control
Consistent authorization across Athena, Glue, and Redshift Spectrum
Scalable permission management using LF-TBAC
Cross-account data sharing

However, some areas remain outside its scope:

Direct access control to raw files in S3
Business metadata management
Data quality management

Relationship with Amazon DataZone

As discussed in my previous article, Lake Formation and DataZone have complementary responsibilities.

Service	Role
Lake Formation	Technical governance (who can access what)
Amazon DataZone	Business governance (discovering, understanding, and requesting data)

A useful way to think about them is:

Lake Formation = Technical foundation for governance
DataZone = Business foundation for governance

Combined with Glue Data Catalog, these services form a comprehensive data catalog and governance solution on AWS.

Conclusion

In this article, I reviewed AWS Lake Formation from its fundamentals through practical implementation patterns.

While there is a learning curve, it is an extremely important service for implementing proper data governance.

Key takeaways:

Lake Formation complements IAM rather than replacing it, adding fine-grained table-, column-, and row-level controls.
Column, row, and cell-level security are implemented through Data Filters.
LF-TBAC reduces operational overhead as the number of tables grows.
Lake Formation integrates tightly with Glue Data Catalog by adding a governance layer on top of metadata management.
Understanding IAMAllowedPrincipals and using Hybrid Access Mode for gradual adoption is essential in existing environments.

Lake Formation certainly introduces some complexity, but when implementing proper access control in a data lake, the limitations of IAM alone eventually become apparent.

In environments where data is consumed by multiple teams and a wide variety of users, Lake Formation is well worth considering.

That said, successful adoption depends on maintaining consistency across IAM, Lake Formation, and S3 bucket policies, so careful planning is essential.

I hope this article helps anyone considering the adoption of Lake Formation.

Rethinking Lakehouse Architecture Through Data Ownership: AWS vs. Snowflake

Aki — Mon, 01 Jun 2026 13:31:04 +0000

Original Japanese article: データの主導権から考えるAWSとSnowflakeのレイクハウスアーキテクチャ

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).
When designing a data platform, discussions about whether to lean toward AWS or Snowflake are still very common.

However, with the rise of Apache Iceberg, data and platforms can now be decoupled. Because of this shift, I believe we need to reconsider the question itself.

Rather than asking:

Should we build around AWS or Snowflake?

A more fundamental question is:

Who owns the data?

In this article, I'd like to define what I mean by data ownership and explore the architectural trade-offs of AWS-centric and Snowflake-centric lakehouse designs.

Why Data Ownership Matters

Apache Iceberg has made it possible to separate data from the platform that accesses it.

Today, an Iceberg table stored on Amazon S3 can be accessed from Athena, Snowflake, Spark, and many other engines. As a result, choosing a product is becoming less important than deciding who is responsible for managing the data.

Before diving into architectural patterns, let's first examine why this shift matters.

Defining Ownership Across Three Layers

In this article, I define data ownership through the following three layers:

Layer	Question	Example
Catalog Ownership	Who owns the metadata?	Glue Data Catalog / Snowflake Open Catalog
Write Ownership	Who can update or delete data?	Glue ETL / Snowflake DML
Governance Ownership	Who controls access policies?	Lake Formation / Snowflake Horizon

Only when these three layers are consistently controlled by the same authority can we truly say that ownership exists.

Conversely, when ownership is distributed or unclear, complexity tends to emerge in architecture, operations, and security.

The Reality of Vendor Lock-In

Even in the Iceberg era, platform dependencies have not disappeared—they have simply changed form.

Catalog dependency: Tables managed by Snowflake Open Catalog still rely operationally on a Snowflake-managed service, although external engines can access them through the REST Catalog API.
Write-engine dependency: Snowflake-managed Iceberg tables are primarily updated through Snowflake, though Horizon Catalog now supports external writes from engines such as Spark. The choice of write engine remains closely tied to catalog design.
Governance dependency: Lake Formation's fine-grained permissions are fundamentally tied to the AWS ecosystem.

Therefore, saying that "Iceberg eliminates vendor lock-in" is only partially true.

What Iceberg removes is storage-format lock-in.

Dependencies around catalog management, governance, and operational processes still remain. In practice, migrating a data platform involves challenges such as governance policies, access control, metadata management, and platform-specific features.

Extensibility and Strategic Flexibility

Data platforms are never finished.

The rapid evolution of AI technologies and the continuous changes in the Modern Data Stack mean that architectures must adapt over time.

Common examples include:

Adding or changing analytics tools
Athena may be sufficient initially, but business users may later request Snowflake access.
Introducing AI workloads
Integration with SageMaker or Snowflake Cortex AI may become necessary.
Cost optimization initiatives
As query volumes grow, Snowflake compute costs may become significant, leading teams to move batch processing to EMR.
Stronger governance requirements
Column masking or row-level security may need to be introduced later.

When ownership across the three layers is clearly defined from the beginning, these changes become easier to evaluate and implement.

Without that clarity, every change raises new questions about where responsibilities and controls should reside.

What Changed After Iceberg?

Historically, data and platforms were tightly coupled.

	Snowflake-Centric	AWS-Centric
Data Location	Inside Snowflake	Inside S3
Management Ownership	Snowflake owns everything	AWS owns everything
Access from Other Engines	Not possible	Snowflake could not access directly

Iceberg fundamentally changed this model.

Iceberg Tables on S3
        ↓
Shared by Multiple Engines

Athena / Glue / Snowflake / Spark / Redshift ...

Iceberg adds a metadata layer on top of Parquet files stored in object storage, enabling ACID transactions and schema evolution independent of any specific compute engine.

A catalog tracks metadata such as schemas and active data files, allowing multiple engines to safely access the same table.

Data files are now shareable.

However, ownership of the catalog, write operations, and governance still depends on architectural decisions.

In other words, deciding who manages the catalog effectively determines who owns the data.

Major Iceberg Catalog Options

Catalog Type Characteristics

AWS Glue Data Catalog AWS-managed Supports REST Catalog API and integrates with Lake Formation governance

Snowflake Open Catalog Snowflake-managed (based on Apache Polaris) REST Catalog compliant and accessible from Spark, Trino, and others

Snowflake Horizon Catalog Snowflake service Exposes Snowflake-managed Iceberg tables through APIs; differs from Open Catalog because it is not a standalone metadata store

Catalog	Type	Characteristics
AWS Glue Data Catalog	AWS-managed	Supports REST Catalog API and integrates with Lake Formation governance
Snowflake Open Catalog	Snowflake-managed (based on Apache Polaris)	REST Catalog compliant and accessible from Spark, Trino, and others
Snowflake Horizon Catalog	Snowflake service	Exposes Snowflake-managed Iceberg tables through APIs; differs from Open Catalog because it is not a standalone metadata store

Snowflake-Centric Architecture

Characteristics

In this approach, Snowflake becomes the center of catalog management, governance, and analytics, while data files remain in external object storage such as S3.

This model prioritizes simplicity and a streamlined analytics experience.

Ownership Model

Layer	Owner
Catalog Ownership	Snowflake Open Catalog or Horizon Catalog
Write Ownership	Primarily Snowflake DML
Governance Ownership	Snowflake Horizon

Although data files remain on S3, external engines can access Snowflake-managed Iceberg tables through two mechanisms:

Via Open Catalog: Snowflake-managed Iceberg tables are synced to Open Catalog and exposed through the REST Catalog API. In this sync scenario, external engines have read-only access. (Note: when Open Catalog itself is used as an internal catalog, read/write access is supported.)
Via Horizon Catalog: Tables are exposed directly through the Horizon Iceberg REST Catalog API without syncing to Open Catalog. External engines can both read and write, and existing Snowflake users and roles can be used for access control.

Benefits

Governance policies such as column masking and row-level security can be applied to Iceberg tables in the same way as native Snowflake tables. When external engines access tables through Horizon Catalog, the same policies are enforced at read time. Note, however, that writing to tables with masking policies or tags applied is not supported from external engines — this is an important constraint to be aware of.
Rich ecosystem support for BI tools such as Power BI makes Snowflake a convenient analytics front end.
External engines can access Iceberg tables through Open Catalog or Horizon Catalog while reusing Snowflake users and roles as the unit of access control.

Drawbacks

Snowflake warehouse compute costs can be significant for write-heavy workloads. When external engines such as Spark write through Horizon Catalog, Snowflake warehouses are not used — but Horizon Catalog API calls are billed at 0.5 credits per million requests, so cost planning is still required.
Coordination is needed when AWS services such as Glue ETL also write to the same datasets. Clearly defining who holds catalog ownership is essential.

Even with Iceberg, many enterprises ultimately converge on a Snowflake-centric operating model because governance, metadata, and write operations all remain concentrated within Snowflake.

In such cases, Iceberg provides openness in theory, but ownership remains firmly within the Snowflake ecosystem.

AWS-Centric Architecture

Characteristics

This architecture uses S3 for storage, Glue Data Catalog for metadata, and AWS-native services for ETL, analytics, and governance.

Its primary advantages are flexibility and service interoperability.

Ownership Model

Layer	Owner
Catalog Ownership	AWS Glue Data Catalog
Write Ownership	Glue ETL / EMR
Governance Ownership	Lake Formation

Because Glue Data Catalog supports the Iceberg REST Catalog API, external engines such as Snowflake and Databricks can access the same tables.

This enables AWS to retain ownership while allowing Snowflake to serve as an analytics front end.

Benefits

Tight integration across Athena, Glue, EMR, and Redshift with a shared catalog.
Fine-grained column- and row-level governance through Lake Formation, applicable to Iceberg tables.
Ability to optimize compute engines for different workloads — EMR for large-scale batch, Athena for interactive queries.

Drawbacks

Increased architectural and operational complexity due to the number of AWS services involved.
Additional design considerations for multi-cloud environments, as the catalog remains AWS-dependent.

Lake Formation is powerful, but troubleshooting permission issues can become challenging. Identifying why a specific user cannot access a specific table or row often takes considerable time, requiring mature operational practices and careful permission design.

Combining AWS and Snowflake

A realistic approach is not choosing one platform over the other, but assigning clear responsibilities to each.

The key is defining ownership boundaries upfront.

AWS Owns the Data, Snowflake Powers Analytics

This is one of the most common patterns.

The goal is to maintain data ownership within AWS while leveraging Snowflake's analytics capabilities and its rich ecosystem of BI connectors.

┌──────────────────────────────────────────────────┐
│                       AWS                        │
│  S3 (Iceberg data files)                         │
│  Glue Data Catalog (Catalog Ownership)           │
│  Lake Formation (Governance Ownership)           │
│  Glue / EMR (Write Ownership)                    │
└──────────────────────┬───────────────────────────┘
                       │ Iceberg REST Catalog API
        ┌──────────────┼───────────────────┐
        ▼              ▼                   ▼
     Athena          Glue              Snowflake
  (Interactive)     (ETL)            (Analytics)

In this model:

Layer	Owner
Catalog Ownership	AWS
Write Ownership	AWS
Governance Ownership	AWS

Snowflake acts primarily as an analytical interface.

Two variations exist:

1. Glue Catalog Integration (Read-Only)

Snowflake accesses AWS-managed Iceberg tables through External Iceberg Tables. Write ownership and governance remain entirely with AWS. Lake Formation can be used as the single source of truth for access control.

2. Catalog-Linked Database (Read/Write)

Snowflake can update Iceberg tables through the Iceberg REST Catalog API while the data remains stored on S3. This approach is attractive when analysts and AI workloads primarily operate in Snowflake.

However, governance responsibilities become shared between AWS and Snowflake. Both Lake Formation and Snowflake-side access controls must be configured carefully — a misconfiguration in either can become a security gap. If the read-only pattern (option 1) is sufficient, consolidating governance in Lake Formation is simpler.

For step-by-step implementation details of these patterns — including how to set up External Volumes, Catalog Integrations, and Catalog-Linked Databases — see this companion article:

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

Comparison: Three Architectural Patterns

Dimension	Snowflake-Centric	AWS-Centric	Hybrid
Catalog Ownership	Snowflake	AWS	AWS
Write Ownership	Snowflake	AWS	AWS
Governance Ownership	Snowflake Horizon	Lake Formation	AWS primary (①) / AWS+SF (②)
Compute Cost	Tends to be higher	Optimizable by workload	Optimizable by workload
Operational Complexity	Low to medium	Medium to high	High
Multi-Engine Flexibility	Medium (via REST API)	High	High

Choosing the Right Pattern

Based on the patterns above, here is a simplified decision guide:

Snowflake-centric tends to fit when:

Analytics is BI-driven or led by non-engineers
Development speed and analytics experience take priority over data volume
Centralized governance through Snowflake Horizon is preferred

AWS-centric tends to fit when:

Data volumes are large and ETL is the dominant workload
A dedicated data engineering team is already working within the AWS ecosystem
Fine-grained access control through Lake Formation is a requirement

Hybrid tends to fit when:

Different teams use different tools (e.g., engineers on AWS, analysts on Snowflake)
Future extensibility for AI, ML, or multi-engine workloads is a priority
AWS retains data ownership while Snowflake's query performance is still needed

What Happens When Ownership Is Unclear

A common anti-pattern is building a platform that "works" without explicitly defining ownership.

Typical symptoms include:

Nobody knows who is responsible for schema changes. When both Glue and Snowflake have schema owners, it becomes unclear which definition is authoritative.
Data written from Snowflake is not visible in Athena. When two catalogs attempt to manage the same table, one may lose track of the latest snapshot, causing metadata inconsistencies.
Governance rules drift between Lake Formation and Snowflake Horizon. Maintaining access policies in two places creates risk — a gap in either becomes a security vulnerability.
Incident response slows down. When multiple engines can write, identifying what happened and where becomes difficult, delaying recovery.

These issues often evolve from technical challenges into organizational problems:

Teams blame each other over unclear responsibilities.
Audits become difficult because nobody can fully explain who has access to what.
Incident recovery is delayed due to unclear decision-making authority.

A running system is not necessarily a well-designed system.

Ownership becomes increasingly difficult to fix after the platform has already grown.

"AWS or Snowflake?" Is a Secondary Question

In practice, organizations often begin by debating whether to standardize on AWS or Snowflake.

In the Iceberg era, I believe that is the wrong starting point.

The first questions should be:

Who owns the catalog?
Who owns writes?
Who owns governance?

Once these three ownership layers are defined, the platform choice naturally follows.

Want all three owned by Snowflake? → Snowflake-centric architecture.
Want all three owned by AWS? → AWS-centric architecture.
Want AWS to own data while Snowflake provides analytics? → Hybrid architecture.

Iceberg has dramatically increased flexibility around where data lives.

As flexibility increases, architects must become more deliberate about defining responsibility.

Starting with product selection often leads to contradictions later. A configuration where Snowflake is used as the query interface, Glue handles writes, and Lake Formation controls governance — without intentional design — is a classic symptom of ownership being distributed and unclear from the start.

The hardest challenge is no longer connectivity.

It is ownership.

Conclusion

Apache Iceberg has significantly reduced storage-level vendor lock-in.

However, catalog ownership, write ownership, and governance ownership still require deliberate architectural decisions.

A useful decision-making sequence is:

Decide who owns the catalog. (Glue / Snowflake Open Catalog / Snowflake Horizon)
Decide who owns writes. (AWS-native services / Snowflake)
Decide who owns governance. (Lake Formation / Snowflake Horizon / both)

Once those three decisions are made, choosing between AWS and Snowflake becomes much easier. From there, you can design the architecture that best fits your requirements.

Ultimately, the hardest part of a modern lakehouse architecture is often not the technology itself. It is agreeing on ownership boundaries — deciding which team manages the catalog, who is responsible for data updates, and where governance policies are enforced.

Technology evolves. The challenge of people and processes remains.

I hope this article helps anyone evaluating lakehouse architectures built on AWS, Snowflake, and Apache Iceberg.

Exploring Snowpark While Comparing It with Apache Spark

Aki — Mon, 01 Jun 2026 04:00:00 +0000

Original Japanese article: Snowparkを動かしながらSparkとの違いを整理してみる

Introduction

Recently, I've had more opportunities to work with Snowflake when building data platforms.

When working with modern data platforms, Apache Spark is often used for distributed data processing. Snowflake also provides its own data processing framework called Snowpark.

If you're already familiar with Spark or AWS Glue, you may find yourself wondering:

"Wait... how is Snowpark actually different from Spark?"

In this article, I'd like to organize my own understanding while exploring Snowpark's behavior and comparing it with Spark.

For this experiment, everything was done entirely within Snowflake Notebooks in Snowsight.

One of the biggest advantages is that no local environment setup or connection configuration is required—you can start experimenting immediately.

What is Snowpark?

Snowpark is a data processing framework provided by Snowflake.

Its biggest feature is the ability to write code in Python, Java, or Scala and execute it directly inside Snowflake.

Traditionally, Snowflake workloads were primarily implemented using SQL. With Snowpark, however, you can use a DataFrame API similar to Spark or Pandas while keeping all processing within Snowflake.

In other words, you no longer need to pull data into a local environment or AWS Lambda for processing.

Some key characteristics include:

Managed execution environment – Processing runs on Snowflake warehouses with no infrastructure management required.
DataFrame API – Similar developer experience to Spark and Pandas.
Pushdown execution – Code is executed within Snowflake, eliminating data transfer overhead.
UDF and UDTF support – Custom functions can be defined and executed inside Snowflake.
Snowflake Notebook integration – Interactive development is supported directly in Snowsight.

Personally, I still like Scala, but these days I write most data processing code in Python.

While Scala often offers better performance, Python's simplicity and extensive ecosystem make it the more practical choice in many situations.

Differences Between Spark and Snowpark

Many people immediately think of Spark when they hear the name Snowpark.

The names are similar, and the DataFrame APIs feel very familiar.

However, there are several important differences.

Aspect	Spark	Snowpark
Execution Environment	Distributed cluster	Snowflake warehouse
Data Source	HDFS, S3, and others	Primarily Snowflake tables
Scaling	Cluster size managed by user	Warehouse size adjustment
Languages	Scala, Java, Python, R, etc.	Python, Java, Scala
External Data Support	Broad ecosystem support	Primarily Snowflake-centric
Infrastructure Management	Cluster management required	Fully managed

Spark requires awareness of distributed clusters and execution mechanics.

Snowpark, on the other hand, is fundamentally a processing framework that operates on top of the Snowflake platform.

DataFrame operations are internally converted into SQL execution plans and executed by Snowflake's SQL engine.

Unlike Spark, user code is not distributed across worker nodes.

Scaling is handled by Snowflake warehouses.

UDFs are an exception. UDF code is pushed into Snowflake and executed in parallel by Snowflake's infrastructure.

A useful mental model is:

DataFrame operations → SQL generation

UDFs → Server-side parallel execution

In either case, users do not need to manage clusters or DAG execution as they would in Spark.

If your data is already centralized in Snowflake, Snowpark provides a convenient way to write Spark-like code without worrying about infrastructure management.

Of course, AWS Glue also provides a largely serverless experience, making it another convenient option in the AWS ecosystem.

Getting Started

All examples in this article are executed within Snowflake Notebooks in Snowsight.

No local Python environment or connection configuration is required—the entire workflow runs directly in the browser.

Prerequisites

From the Snowsight menu:

Create → Notebooks

Create a new notebook.

Snowpark for Python is already installed, so there is no need to run pip install.

You can start coding immediately.

Obtaining a Session

In local environments, Snowpark sessions are typically created using:

Session.builder.configs(...).create()

In Snowflake Notebooks, an active session already exists.

You can simply retrieve it:

from snowflake.snowpark.context import get_active_session

session = get_active_session()
print("Session acquired successfully!")

One of the major advantages of Snowflake Notebooks is that connection details never need to be written manually.

Basic DataFrame Operations

Let's create and manipulate a DataFrame from a table.

df = session.table("MY_DB.MY_SCHEMA.SALES_DATA")

result_df = df.select("ORDER_ID", "AMOUNT", "REGION") \
              .filter(df["REGION"] == "Asia") \
              .sort(df["AMOUNT"].desc())

result_df.show()

One important point is that no SQL is actually executed until show() is called.

We'll discuss this in more detail when covering lazy evaluation.

Aggregations

GroupBy operations feel almost identical to Spark.

from snowflake.snowpark import functions as F

summary_df = df.group_by("REGION") \
               .agg(
                   F.sum("AMOUNT").alias("TOTAL_AMOUNT"),
                   F.count("ORDER_ID").alias("ORDER_COUNT"),
                   F.avg("AMOUNT").alias("AVG_AMOUNT")
               )

summary_df.show()

When executed, these DataFrame operations are translated into SQL and run within Snowflake.

You can inspect the generated SQL using:

print(summary_df.queries)

Writing Results Back to a Table

To save results into a Snowflake table:

summary_df.write.mode("overwrite").save_as_table(
    "MY_DB.MY_SCHEMA.SALES_SUMMARY"
)

Since save_as_table() does not return a result, it's often useful to reload the table to verify the output.

session.table("MY_DB.MY_SCHEMA.SALES_SUMMARY").show()

The overwrite mode replaces the existing table.

Use append if you want to add rows instead.

How Does Lazy Evaluation Work?

Anyone familiar with Spark has likely encountered lazy evaluation.

Sometimes it can even lead to unexpected behavior during debugging.

Snowpark adopts the same fundamental concept.

Understanding Lazy Evaluation

DataFrame transformations such as:

select
filter
group_by

are not executed immediately.

These operations merely build an execution plan.

Actual execution occurs only when an action is triggered.

Common action operations include:

show()
collect()
count()
to_pandas()
write.save_as_table()

Verifying Lazy Evaluation

A convenient way to inspect behavior is through df.queries.

from snowflake.snowpark import functions as F

df_filtered = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
                     .filter(F.col("STATUS") == "ACTIVE") \
                     .select("ID", "NAME", "STATUS", "CREATED_AT")

print(df_filtered.queries)

result = df_filtered.collect()
print(f"{len(result)} rows retrieved")

The generated SQL can be inspected before execution, but no query has actually been sent to Snowflake yet.

To verify execution timing precisely, we can use Query History in Snowsight.

Open:

Monitoring → Query History

Then perform the following steps:

Define the DataFrame.
Check Query History.
Confirm that no SELECT statement has been executed.
Execute collect().
Refresh Query History.
Observe that the SELECT statement now appears.

Define the DataFrame

No corresponding query appears yet.

Execute collect()

After execution, the query becomes visible.

This confirms that DataFrame definitions alone do not trigger execution.

The SQL is executed only when collect() is called.

Differences from Spark's Lazy Evaluation

In Spark, lazy evaluation constructs a DAG that is optimized and executed across a cluster.

In Snowpark, lazy evaluation ultimately produces SQL, which is then optimized and executed by Snowflake's query optimizer.

The concept is similar, but the execution engine is fundamentally different.

One particularly useful feature is that generated SQL can be inspected via df.queries, making it easier to validate execution plans.

Can We Use Caching?

If you're coming from Spark, your first instinct may be to use cache().

Snowpark provides a similar capability through cache_result().

Differences from Spark cache()

Aspect	Spark `cache()`	Snowpark `cache_result()`
Storage	Memory (and disk)	Temporary table
Lifetime	Until application ends	Until session ends
Cost	No additional write	INSERT into temporary table

Internally, cache_result() materializes results into a temporary table.

Subsequent operations reuse that table rather than re-running expensive transformations.

Example

df_heavy = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
                  .filter(F.col("STATUS") == "ACTIVE") \
                  .join(
                      session.table("MY_DB.MY_SCHEMA.MASTER"),
                      "ID"
                  )

cached_df = df_heavy.cache_result()

result1 = cached_df.filter(
    F.col("REGION") == "Asia"
).collect()

result2 = cached_df.group_by("REGION") \
                   .agg(F.count("*")) \
                   .collect()

cached_df.drop_table()

Using a with block is often more convenient because the temporary table is automatically dropped when the block exits.

with df_heavy.cache_result() as cached_df:
    result1 = cached_df.filter(
        F.col("REGION") == "Asia"
    ).collect()

    result2 = cached_df.group_by("REGION") \
                       .agg(F.count("*")) \
                       .collect()

Since cache_result() performs an INSERT into a temporary table, it can actually make things slower when the DataFrame is only used once.

It's most effective when the same expensive transformation is reused multiple times.

You can also observe this behavior in Snowsight.

Temporary table creation:

Subsequent SELECT from the temporary table:

Another query reusing the same temporary table:

Use Cases

Let's consider some practical scenarios where Snowpark can be useful.

ETL Pipelines

Traditionally, pipelines often look like:

S3 → Glue → Redshift

With Snowpark, many transformations can be performed entirely within Snowflake.

This reduces data movement and simplifies overall architecture.

Example:

raw_df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")

cleaned_df = raw_df \
    .filter(raw_df["EVENT_TYPE"] != "test") \
    .with_column(
        "EVENT_DATE",
        F.to_date(raw_df["EVENT_TIMESTAMP"])
    ) \
    .drop_duplicates([
        "USER_ID",
        "EVENT_DATE",
        "EVENT_TYPE"
    ])

aggregated_df = cleaned_df \
    .group_by("EVENT_DATE", "EVENT_TYPE") \
    .agg(
        F.count("USER_ID")
         .alias("USER_COUNT")
    )

aggregated_df.write.mode("overwrite") \
             .save_as_table(
                 "MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
             )

session.table(
    "MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
).show()

Custom Transformations Using UDFs

Snowpark UDFs allow complex logic that would be cumbersome in SQL to be implemented in Python.

You can register UDFs using either the @udf decorator or session.udf.register().

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType

@udf(return_type=StringType(),
     input_types=[StringType()])
def normalize_region(region: str) -> str:
    region_map = {
        "US": "North America",
        "JP": "Asia",
        "DE": "Europe",
    }
    return region_map.get(region, "Other")

df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")

df_with_region = df.with_column(
    "NORMALIZED_REGION",
    normalize_region(df["REGION_CODE"])
)

If type hints are available, explicit type definitions can be omitted:

from snowflake.snowpark.functions import udf

@udf
def normalize_region(region: str) -> str:
    region_map = {
        "US": "North America",
        "JP": "Asia",
        "DE": "Europe",
    }
    return region_map.get(region, "Other")

That said, simple transformations are often faster when implemented using built-in SQL functions.

As always, benchmark before deciding.

Data Quality Validation

Snowpark can also be used for data quality checks before processing continues.

total_count = df.count()
null_count = df.filter(
    F.col("AMOUNT").is_null()
).count()

null_rate = null_count / total_count

if null_rate > 0.10:
    raise ValueError(
        f"NULL rate exceeds the threshold: {null_rate:.1%}"
    )

print(
    f"Data quality check passed "
    f"(NULL rate: {null_rate:.1%})"
)

Conclusion

In this article, we explored Snowpark's fundamentals, compared it with Spark, and examined its lazy evaluation behavior.

For engineers already familiar with Spark, Snowpark should feel quite approachable.

However, it's important to remember that execution occurs on Snowflake warehouses rather than a Spark cluster.

Reviewing generated SQL and understanding how Snowflake executes queries can help avoid unexpected full-table scans and other performance issues.

If your data is already centralized in Snowflake, keeping processing inside Snowflake rather than moving data to Lambda or Glue Python Shell can be a significant advantage.

Reducing infrastructure management overhead and consolidating ETL processing within Snowflake can also improve maintainability.

One final note: throughout this experiment, I frequently relied on Cortex Code whenever I encountered errors.

The workflow of iteratively fixing notebook errors through Cortex Code was surprisingly convenient.

That said, just like any AI-assisted coding workflow, it's still important to carefully validate the generated code rather than accepting it blindly.

I hope this article helps anyone considering Snowpark for data processing within Snowflake.

Organizing How to Use AWS Glue Workflow

Aki — Fri, 22 May 2026 13:38:32 +0000

Original Japanese article: AWS Glue Workflowの使い方について整理してみる

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

Previously, I wrote an article comparing when to use AWS Step Functions versus Glue Workflow.
Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs

As I mentioned there, I personally like Glue Workflow and consider it an excellent service that balances simplicity and low cost.

However, in recent years, Step Functions has become increasingly mainstream, and I get the impression that opportunities to work with Glue Workflow have decreased.
Because of that, I think many people are unsure about how they should actually use it in practice.

So in this article, I’d like to organize everything from the basics of Glue Workflow to practical usage patterns.
I hope this helps more people become interested in Glue Workflow.

What Is Glue Workflow?

AWS Glue Workflow is Glue’s native workflow orchestration feature.
It defines ETL pipelines by combining the following three elements:

Component	Role
Glue Job	The actual ETL processing using Spark or Python Shell
Glue Crawler	Scans data sources such as S3 and registers table definitions in the Data Catalog
Glue Trigger	Defines execution conditions for Jobs and Crawlers (schedule, event, conditional, etc.)

By connecting these components as a DAG (Directed Acyclic Graph), you can build ETL pipelines.

Another characteristic is that workflows can be visually configured through the Glue console GUI.

A major advantage is that workflows themselves incur no additional cost—you only pay for Job and Crawler execution.

Details of the Core Components

The core of Glue Workflow is the Trigger mechanism.
There are four trigger types, each with different roles.

Trigger Type	Execution Condition	Typical Use Case
SCHEDULED	Scheduled execution using cron expressions (UTC, minimum 5-minute interval)	Periodic execution such as daily ETL
ON_DEMAND	Manual execution or via API/SDK	Arbitrary execution timing
CONDITIONAL	Triggered based on the status of preceding Jobs/Crawlers	Chained execution between Jobs
EVENT	Triggered by EventBridge events	Event-driven pipelines

The typical pattern is:

Use SCHEDULED, ON_DEMAND, or EVENT as the workflow’s starting trigger
Use CONDITIONAL to connect downstream Jobs

By the way, AWS officially recommends keeping the number of elements included in a workflow (Jobs + Crawlers + Triggers) under 100.
Exceeding this recommendation can cause errors when resuming or stopping Workflow Runs.

CONDITIONAL Trigger Usage Patterns

CONDITIONAL Triggers are one of the key features that provide flexibility in Glue Workflow.
Here are several representative patterns.

1. Simple Sequential Pattern

This is the most basic usage pattern: “Run JobB after JobA succeeds.”

JobA (SUCCEEDED) → JobB

You can implement this simply by configuring a CONDITIONAL Trigger with the condition:
“Start when JobA reaches the SUCCEEDED state.”

2. Waiting for Multiple Jobs to Complete (AND Condition)

This pattern runs JobC only after both JobA and JobB succeed.

JobA (SUCCEEDED) ┐
                 ├→ JobC
JobB (SUCCEEDED) ┘

This can be implemented by setting Logical: AND in the CONDITIONAL Trigger predicate and listing multiple conditions.

This is useful in scenarios such as:
“Run an aggregation Job only after multiple data sources have finished loading.”

3. Triggering When Any Job Completes (ANY Condition)

This pattern runs JobC when either JobA or JobB succeeds.

The predicate of a CONDITIONAL Trigger has a Logical field where you can specify either AND or ANY.
Using ANY causes the trigger to fire as soon as any one of the specified conditions is satisfied.

Note that although this behavior is logically equivalent to “OR,” the actual Glue configuration value is ANY.
This is important when defining workflows using IaC or CLI because specifying OR will result in an error.

4. Failure Branching (Catching FAILED States)

CONDITIONAL Triggers can react not only to SUCCEEDED, but also to states such as:

FAILED
STOPPED
TIMEOUT
ERROR

(The supported states differ slightly between Jobs and Crawlers.)

Using this feature, you can create patterns such as launching a notification Job (for example, a Python Shell Job that publishes to SNS) when a Job fails.

JobA (SUCCEEDED) → Downstream Processing
JobA (FAILED)    → Notification Job

For relatively simple error handling, this approach allows you to avoid introducing Step Functions.

Managing Parameters with `default_run_properties`

Glue Workflow provides a property called default_run_properties, which acts like globally shared variables across the workflow.

How It Works

default_run_properties stores key-value pairs that can be referenced by all Jobs within the workflow.

It functions as the default set of parameters passed during Workflow execution and serves as the foundation for sharing information between Jobs.

One important note:
Run Property values may appear in logs, so you should avoid storing secrets directly in them.

Instead, retrieve secrets through services such as:

AWS Secrets Manager
Glue Connections

How to Configure It

There are three main configuration methods.

Configure from the Console

In the Glue console:
Workflow → Edit Properties

You can add key-value pairs there.

Configure via boto3

import boto3

glue = boto3.client('glue')

glue.create_workflow(
    Name='my-workflow',
    DefaultRunProperties={
        'env': 'production',
        'target_date': '2026-05-21'
    }
)

Configure via IaC (CloudFormation, etc.)

You can specify DefaultRunProperties in the AWS::Glue::Workflow resource.

Static Parameters vs Dynamic Parameters

Typical usage patterns include:

Static parameters
- Environment names (dev / prod)
- S3 bucket names
- Values that rarely change
Dynamic parameters
- Processing dates
- Execution IDs
- Values that change per execution

Dynamic parameters can be updated either by:

Passing RunProperties to start_workflow_run
Dynamically updating them later using put_workflow_run_properties

Passing Data Between Jobs

Using default_run_properties as the foundation, let’s look at how to exchange data between Jobs.

Dynamically Updating Run Properties

During Job execution, you can dynamically update Workflow Run properties by calling the put_workflow_run_properties API.

import boto3

glue = boto3.client('glue')

glue.put_workflow_run_properties(
    Name='my-workflow',
    RunId=workflow_run_id,
    RunProperties={
        'processed_records': '12345',
        'output_path': 's3://mybucket/output/2026-05-21/'
    }
)

This allows downstream Jobs to reference values calculated by upstream Jobs.

Retrieving Properties from a PySpark Job

Inside a PySpark Job, you first retrieve the Workflow Run ID and then call get_workflow_run_properties.

import sys
import boto3
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(
    sys.argv,
    ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
)

glue = boto3.client('glue')

response = glue.get_workflow_run_properties(
    Name=args['WORKFLOW_NAME'],
    RunId=args['WORKFLOW_RUN_ID']
)

properties = response['RunProperties']

target_date = properties.get('target_date')

WORKFLOW_NAME and WORKFLOW_RUN_ID are special arguments automatically passed when a Job is launched through a Workflow.

Retrieving Properties from a Python Shell Job

The basic approach is the same for Python Shell Jobs:

Retrieve arguments using getResolvedOptions
Access properties through boto3

Since Python Shell Jobs do not require SparkContext initialization, they can be written more lightweight.

They are also cheaper than Spark-based Glue Jobs, so depending on the requirements, they can be a good option.

Personally, I also like Python Shell Jobs, although I feel opportunities to use them in real-world projects have decreased, which is a bit unfortunate.

I’ve also written articles about Python Shell Jobs, so feel free to check them out.

S3 Triggers: How to Launch Glue Python Shell via AWS Lambda

Anti-Patterns for Data Passing

default_run_properties should only be used for metadata-like information exchange.

The following usage patterns should generally be avoided:

Passing large datasets directly
- Store the data itself in S3 and pass only the path
Storing secrets
- Use Secrets Manager or Glue Connections instead
Frequently rewriting properties
- This increases API calls and introduces race condition risks

Integrating with EventBridge and Other Services

Glue Workflow becomes much more flexible when combined with EventBridge.

EventBridge-Based Startup (EVENT Trigger)

Glue Workflow can be started directly by EventBridge events.

This is achieved by setting the Trigger Type to EVENT.

aws glue create-trigger \
  --workflow-name my-workflow \
  --type EVENT \
  --name s3-arrival-trigger \
  --actions JobName=my-job

By configuring an EventBridge rule with Glue Workflow as the target, the workflow starts automatically when an event occurs.

However, appropriate IAM permissions such as glue:notifyEvent are required.

Batch Event Startup

EVENT Triggers also support event batching.

Using EventBatchingCondition, you can configure the workflow to start when either:

N events arrive
M seconds pass since the first event arrived

aws glue create-trigger \
  --workflow-name my-workflow \
  --type EVENT \
  --name batch-trigger \
  --event-batching-condition BatchSize=10,BatchWindow=300 \
  --actions JobName=my-job

This enables patterns such as:
“Run ETL once 100 files have arrived.”

The maximum batch window is 900 seconds (15 minutes).

Starting from S3 Events (The Parameter Passing Limitation)

A common use case is:
“Start a workflow when a file is uploaded to S3.”

When starting Glue Workflow through EventBridge, the event IDs are automatically stored in a Run Property called aws:eventIds.

event_ids = glue_client.get_workflow_run_properties(
    Name=workflow_name,
    RunId=workflow_run_id
)['RunProperties']['aws:eventIds']

The returned value looks like:

'["abc-123", "def-456"]'

However, this is where one of Glue Workflow’s limitations becomes apparent.

The EventBridge event payload itself (such as the S3 object key or bucket name) is not automatically passed as Run Properties.

Only the event IDs are provided.

If you need the actual object details, your Job must retrieve the corresponding event contents from CloudTrail, which becomes somewhat cumbersome.

Because of this, many cases are easier to manage by placing Lambda in the middle and explicitly calling start_workflow_run with structured Run Properties.

S3 PUT → EventBridge → Lambda → start_workflow_run (pass parameters via RunProperties)

Example Lambda code:

import boto3

glue = boto3.client('glue')

def lambda_handler(event, context):
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']

    glue.start_workflow_run(
        Name='my-workflow',
        RunProperties={
            'source_bucket': bucket,
            'source_key': key
        }
    )

With Step Functions, the EventBridge payload can be received directly using paths such as $.detail, so there is no need for an intermediate Lambda function.

This is one of the areas where Glue Workflow limitations become noticeable compared to Step Functions.

Detecting Workflow Completion via EventBridge

Glue Workflow status changes are emitted as EventBridge events.

You can use this to:

Send SNS notifications
Trigger downstream systems
Launch post-processing workflows

Glue Workflow (COMPLETED/FAILED)
    → EventBridge
        → SNS / Lambda

This is especially useful when you want separate processing for success and failure cases.

Calling Glue Workflow from Step Functions

It is also possible to launch Glue Workflow from Step Functions.

However, since there is no .sync integration pattern available, you must implement your own polling logic to detect completion.

I covered this in detail in the previous article, so feel free to refer to it if interested.

Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs

Operational Tips

Here are several practical points worth knowing when operating Glue Workflow in production.

Resuming Failed Workflows (`ResumeWorkflowRun`)

Glue Workflow provides a feature called ResumeWorkflowRun, which allows resuming execution from failed nodes.

In the console, you can:

Open the failed Workflow Run detail page
Select the nodes to resume
Enable the “Resume” checkbox

It is also available through CLI/API.

aws glue resume-workflow-run \
  --name my-workflow \
  --run-id wr_xxxx \
  --node-ids node_yyyy node_zzzz

When resumed:

The specified nodes
And all downstream nodes

are re-executed.

The resumed workflow is tracked using a new Run ID.

Compared with Step Functions Redrive, however, there are several limitations:

You must explicitly specify failed nodes
Retrieving node IDs requires calling: get-workflow-run --include-graph
Additional IAM permissions (glue:ResumeWorkflowRun) are required

For simple retry scenarios, restarting the entire workflow via ON_DEMAND execution is often easier.

For pipelines with more sophisticated recovery requirements, Step Functions tends to provide a smoother operational experience.

Monitoring with CloudWatch

Glue Workflow execution states can be viewed in the Glue console, but detailed Job and Crawler logs are output to CloudWatch Logs.

Typical monitoring targets include:

Workflow Run list
- Glue Console
Job execution logs
- /aws-glue/jobs/output
- /aws-glue/jobs/error
Metrics
- CloudWatch Metrics
- Execution duration
- DPU usage

In practice, monitoring and alerting strategies often combine:

Workflow-level statuses
Job-level metrics

depending on the situation.

Handling the 100-Object Recommendation Limit

As mentioned earlier, AWS recommends keeping the total number of objects in a workflow (Jobs + Crawlers + Triggers) below 100.

If a large pipeline approaches this limit, consider:

Splitting workflows
- Trigger downstream workflows after upstream completion
Consolidating common processing inside Jobs
- Merge smaller processing units into larger Jobs

In my experience, workflows approaching 100 objects are often already too complex from a design perspective.
At that point, it may be worth reconsidering the architecture itself or migrating to Step Functions.

Where Glue Workflow Fits — and Its Limitations

Glue Workflow shines in scenarios such as:

Simple ETL pipelines completed entirely within Glue Jobs and Crawlers
Periodic ingestion pipelines for data lakes
Lightweight small-to-medium scale pipelines that need to be launched quickly

On the other hand, Step Functions is generally better suited for:

Integrations involving Lambda, ECS, and other AWS services
Complex branching and dynamic parameter control
Large-scale development involving multiple teams

I discussed these decision criteria in more detail in the previous article, so feel free to refer to it.

Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs

Conclusion

In this article, I organized AWS Glue Workflow from the basics through practical usage patterns.

To summarize:

Four trigger types:
SCHEDULED / ON_DEMAND / CONDITIONAL / EVENT
CONDITIONAL Triggers:
Flexible flow control using AND / ANY conditions and failure branching
default_run_properties:
Shared workflow-wide parameter management
Data passing between Jobs:
Dynamic value propagation using put_workflow_run_properties
EventBridge integration:
Event-driven execution via EVENT Trigger
(although Lambda-based parameter passing is often easier)
ResumeWorkflowRun:
Partial restart functionality from failed nodes

After writing all of this, I still have to admit:
Step Functions is generally easier to use.

That said, Glue Workflow still provides meaningful value today because it allows you to build Glue-centric ETL pipelines in a very simple and cost-efficient way.

Rather than defaulting to Step Functions automatically, understanding Glue Workflow properly and knowing when to use it can broaden your architectural options.

I hope this article helps both people who are starting to use Glue Workflow and those already working with it.

Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs

Aki — Wed, 20 May 2026 12:09:21 +0000

Original Japanese article: Glue JobのETL処理におけるAWS Step FunctionsとGlue Workflowの使い分けを整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

When building data pipelines on AWS, deciding which workflow orchestration tool to use is an important architectural decision.
Especially when designing ETL pipelines centered around Glue Jobs, the following two services are commonly compared:

AWS Step Functions
AWS Glue Workflow

Both services are workflow orchestration tools that manage job dependencies and execute processes sequentially.
However, since they are designed with different strengths and philosophies, choosing the right one based on your requirements is critical.

In this article, I’ll organize the characteristics, advantages, disadvantages, and use cases of these two services, and discuss how to decide which one to choose.

Personally, I really like Glue Workflow, but I’ve had fewer opportunities to use it recently, which is honestly a bit disappointing.
I also like Glue Python Shell, but compared to standard Glue Jobs, I rarely get to use it these days either...

Major Workflow Orchestration Tools

AWS provides multiple workflow orchestration services, but the two most commonly compared for Glue Job–based data pipelines are the following:

Service	Overview
AWS Step Functions	AWS managed workflow orchestration service. Can flexibly orchestrate a wide range of AWS services such as Lambda, Glue, ECS, and more
AWS Glue Workflow	Glue-native workflow feature. Defines pipelines using Glue Jobs, Crawlers, and Triggers

AWS also offers Amazon Managed Workflows for Apache Airflow (MWAA).
MWAA becomes a strong option when you need highly complex dependency management or cross-cloud orchestration. However, in this article, I’ll focus specifically on comparing Step Functions and Glue Workflow.

Recently, many teams have also adopted workflow tools such as Airflow, Dagster, and Prefect. As always, selecting the right tool depends heavily on your requirements and goals.

AWS Step Functions

Overview

(Figure above: Visual workflow definition using Workflow Studio)

AWS Step Functions is a workflow orchestration service that defines workflows using Amazon States Language (ASL), a JSON/YAML-based language.

It integrates with over 200 AWS services including Lambda, Glue, ECS, SNS, and DynamoDB, and supports flexible workflow controls such as conditional branching, parallel execution, and retry handling.

Features

Step Functions provides two workflow types:

Type	Characteristics
Standard Workflow	Long-running execution, audit logs, exactly-once execution guarantee
Express Workflow	High throughput, low cost, optimized for short-lived processing (at-least-once)

In the context of data pipelines, Standard Workflow is generally the more common choice.

Step Functions also provides AWS Step Functions Workflow Studio, a visual editor that allows workflows to be built through a GUI.

Advantages

Multi-service integration: Easily integrates Glue with other AWS services such as Lambda, ECS, SNS, and more
Flexible control flow: Supports conditional branching (Choice), parallel execution (Parallel/Map), and advanced error handling (Catch/Retry)
High observability: Execution history and per-step states can be visually inspected in the console
Event-driven integration: Easily triggered through EventBridge using S3 uploads or schedules
Parallel execution support (Map): Well suited for large-scale processing such as file-level or partition-level parallel execution. Distributed Map is especially useful for high-scale parallel workloads
Infrastructure as Code support: Can be fully managed through CloudFormation, CDK, or Terraform

Disadvantages

Learning curve: Requires understanding ASL, and complex workflows can become verbose
Cost: Standard Workflow pricing is based on state transitions, so costs can increase with larger workflows
Glue integration setup: IAM roles, parameter passing, and Glue Job integration must be configured manually

Use Cases

Step Functions is a strong fit for scenarios such as:

Hybrid pipelines combining Glue with Lambda, ECS, or other AWS services
Complex workflows requiring branching, parallel processing, or dynamic parameter passing
Pipelines that require SNS notifications or compensating actions on failures
Teams managing infrastructure strictly through IaC
Large-scale data platforms involving multiple teams

AWS Glue Workflow

Overview

(Figure above: Visual workflow definition using Glue Workflow)

AWS Glue Workflow is Glue’s native workflow orchestration feature.

It defines pipelines using three primary components:

Glue Jobs
Glue Crawlers
Glue Triggers

Pipelines can be configured directly from the Glue console GUI, making it easy to build Glue-centric ETL workflows quickly.

Features

The major components of Glue Workflow are as follows:

Component	Role
Glue Job	Actual ETL processing using Spark or Python Shell
Glue Crawler	Scans data sources such as S3 and registers table metadata into the Data Catalog
Glue Trigger	Defines execution conditions such as schedules, events, or conditional dependencies

Glue Triggers support three types:

SCHEDULED
ON_DEMAND
CONDITIONAL
EVENT

This allows chained execution patterns, such as triggering downstream jobs based on upstream job success or failure.

Advantages

Native Glue integration: Seamlessly integrates with Glue Jobs and Crawlers with minimal additional setup
Simple configuration: DAG-style workflows can be built intuitively through the console GUI
Low cost: No additional charge for the workflow itself beyond Glue Job execution costs
Easy crawler integration: Natural fit for workflows that update the Data Catalog after ETL execution
Glue Data Catalog integration: Job execution metadata and lineage can be managed centrally within Glue

Disadvantages

Glue-only orchestration: Cannot directly orchestrate Lambda, ECS, or other non-Glue services
Limited event-driven capabilities: Primarily Trigger- and schedule-based, making advanced event integration less flexible than Step Functions
Limited control flow: Weak support for advanced branching and dynamic parameter handling
Observability limitations: Detailed execution logs often require separate CloudWatch investigation
More difficult IaC management: CloudFormation and Terraform management can become cumbersome compared to Step Functions
Limited parallel execution control: Not ideal for fine-grained parallelization or Map-style orchestration
Weaker retry/re-execution control: Re-running only failed portions of a workflow is less flexible than in Step Functions

Use Cases

Glue Workflow is well suited for:

Simple ETL pipelines composed entirely of Glue Jobs and Crawlers
Periodic ingestion pipelines into S3-based data lakes with automatic catalog updates
Small teams or early-stage projects that need fast implementation
Organizations that prefer operating primarily through the Glue console

Which One Should You Choose?

Based on the characteristics of both services, the selection criteria can generally be summarized as follows.

When to Choose Glue Workflow

Your ETL pipeline consists only of Glue Jobs and Crawlers
Simple sequential execution and conditional triggers are sufficient
You want to build quickly (prototypes or small-scale projects)
Your operations are centered around the Glue console
You want to minimize costs

When to Choose Step Functions

You need integration with Lambda, ECS, or other AWS services
You require branching, parallel processing, or advanced error handling
You are adopting an EventBridge-centric event-driven architecture
You want strict Infrastructure as Code management
Multiple teams are involved in operating the data platform
Observability and audit logging are important

Summary of Decision Criteria

Perspective	Glue Workflow	Step Functions
Supported Services	Glue only	Broad AWS integration
Control Flow	Simple	Flexible and advanced
Observability	△	◎
Ease of Configuration	◎	△ (learning curve)
Cost	Low	Depends on state transitions
IaC Management	△	◎
Crawler Integration	◎	△ (manual setup required)

Glue Workflow is fundamentally Trigger-oriented and is not designed for advanced event orchestration like Step Functions.

Unless Glue Workflow specifically satisfies your requirements better, choosing Step Functions is generally the safer long-term option.
Personally, when designing architectures, I often start by considering Step Functions first.

That said, Glue Workflow remains a strong choice when the requirement is simply:

“I want to quickly build a Glue-centric ETL pipeline.”

Combining Both Services

Instead of choosing one or the other exclusively, it is also possible to invoke Glue Workflow from Step Functions.

For example, Step Functions can handle preprocessing and postprocessing with Lambda, while delegating the ETL core to Glue Workflow.

However, this introduces additional complexity because workflow state coordination between the two services must be managed carefully.
If simplicity is important, standardizing on one orchestration tool is generally easier operationally.

Caveats When Invoking Glue Workflow from Step Functions

1. Completion Detection Requires Polling

Step Functions provides a convenient .sync integration pattern for Glue: StartJobRun, which waits for job completion automatically.

However, Glue: StartWorkflowRun does not support the .sync integration pattern.

While you can invoke Glue Workflow through SDK integration, Step Functions will immediately proceed to the next state without waiting for completion.
As a result, you must implement custom polling logic to repeatedly check the WorkflowRun status.

2. Polling Logic Becomes Complex

You typically need to implement a loop like:

Wait
GetWorkflowRun
Choice (RUNNING / COMPLETED / FAILED)

This increases both the number of states and the verbosity of the ASL definition.

3. Error Handling Becomes More Complicated

Glue Workflow status is returned at the WorkflowRun level rather than the individual Job level.

As a result, identifying which specific Glue Job failed requires additional parsing logic against the GetWorkflowRun response.

Because of these complications, although combining both services is technically possible, I generally recommend avoiding Step Functions → Glue Workflow orchestration unless there is a compelling reason.

One possible use case is when you need to extend an existing Glue Workflow–based system incrementally.
Even then, rebuilding the orchestration directly in Step Functions using existing Glue Jobs often feels cleaner.

Migration Considerations

Some teams initially adopt Glue Workflow during the early stages of a project and later migrate to Step Functions as the data platform grows.

When migrating, the following considerations become important:

Workflow definitions must be rewritten: Glue Trigger definitions need to be converted into ASL
IAM roles must be redesigned: Step Functions requires permissions to invoke Glue Jobs
Extensive testing is necessary: Existing jobs must be validated carefully after migration

Migration is certainly possible, but it is not trivial.
This is why it is important to consider future extensibility and maintainability from the beginning when selecting your orchestration tool.

Conclusion

In this article, I organized the differences between AWS Step Functions and Glue Workflow for ETL orchestration.

To summarize:

Glue Workflow: Glue-native, simple, low-cost, and ideal for rapidly building straightforward ETL pipelines
Step Functions: Better suited for multi-service orchestration, advanced workflow control, observability, and large-scale pipelines

Neither service is universally “better.”
The right choice depends on your use cases, organizational structure, operational requirements, and future scalability needs.

For small-scale ETL pipelines, Glue Workflow is often sufficient. However, as data platforms evolve, requirements such as exception handling, notifications, conditional branching, and integrations with other services tend to grow over time. In many cases, architectures gradually move toward Step Functions as complexity increases.

A practical strategy can be to start simple with Glue Workflow during the early stages, and later migrate to Step Functions when requirements become more sophisticated.

That said, considering future migration costs, building on Step Functions from the beginning can also be a very reasonable approach.

I hope this article helps anyone currently evaluating workflow orchestration options on AWS.

Differences Between Snowflake Editions and Secure Connectivity with AWS

Aki — Fri, 15 May 2026 12:55:34 +0000

Original Japanese article: Snowflakeのエディションごとの違いとAWSとのセキュアな接続方法

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

More and more organizations are adopting Snowflake as their data platform.
However, once you actually start planning an implementation, there is often a surprisingly common question:

“Which Snowflake edition should we choose?”

In particular, many teams struggle to decide between Enterprise Edition and Business Critical Edition.

This becomes even more important when Snowflake is used together with AWS and there is a requirement for secure network connectivity.

Snowflake offers four editions, and higher editions provide stronger capabilities around security, compliance, and availability.
In addition, private connectivity using AWS PrivateLink requires Business Critical Edition or higher.

This means that if you initially start with Enterprise Edition and later realize that you need PrivateLink, migrating afterward can introduce additional operational and architectural effort.

In this article, I would like to briefly organize the differences between Snowflake editions and then explore how to securely connect Snowflake with AWS using Business Critical Edition, especially through AWS PrivateLink.

Differences Between Snowflake Editions (Quick Overview)

Snowflake provides the following four editions.
Each higher edition includes all functionality from the lower editions, while credit pricing increases accordingly.

Edition	Positioning	Major Additional Features
Standard	Entry-level	Core features, Time Travel (1 day), SSO, Network Policies
Enterprise	Large-scale / Production workloads	Multi-cluster warehouses, Materialized Views, Time Travel (up to 90 days), Column- and Row-level security
Business Critical	Regulated industries	Tri-Secret Secure, AWS PrivateLink, Failover, HIPAA / PCI DSS
Virtual Private Snowflake (VPS)	Highest isolation level	Fully isolated Snowflake environment with dedicated infrastructure

Standard Edition

This is the entry-level edition and provides the core Snowflake functionality.

It includes capabilities such as:

SQL processing
Semi-structured data support
Data sharing
1-day Time Travel

For organizations simply looking to use Snowflake as a data warehouse, Standard Edition is often sufficient.

It is generally a good fit for startups or smaller analytics teams that want to begin using Snowflake quickly.

Enterprise Edition

Enterprise Edition adds several features that are effectively essential for production workloads.

Key additions include:

Multi-cluster warehouses (for scaling concurrent users)
Materialized Views
Extended Time Travel (up to 90 days)
Column-level and row-level security
Dynamic Data Masking
Query Acceleration Service
Search Optimization Service

In practice, Enterprise Edition often feels like the real starting point for production Snowflake environments.

As data volume and concurrency increase, multi-cluster warehouses become especially important.

Business Critical Edition

Business Critical Edition is designed for organizations handling regulated or highly sensitive data.

In addition to all Enterprise features, it includes:

AWS PrivateLink / Azure Private Link / Google Private Service Connect
Tri-Secret Secure (dual encryption using customer-managed keys)
Account failover / failback
Client redirect
Compliance support for HIPAA, HITRUST CSF, PCI DSS, FedRAMP

If you are handling sensitive information such as:

PHI (medical data)
PCI-related cardholder data
Personally identifiable information

then Business Critical Edition becomes necessary.

It is also mandatory if your networking requirements specify that connectivity must avoid the public internet and use PrivateLink instead.

Data is one of the most important enterprise assets.
For that reason alone, I often see organizations adopting Business Critical Edition.

Virtual Private Snowflake (VPS)

This is Snowflake’s highest edition and provides a dedicated Snowflake environment.

Infrastructure is completely isolated and hardware resources are not shared with other customers.

It is intended for organizations with extremely strict security requirements, such as:

Financial institutions
Government agencies

I personally have not worked with VPS directly, and pricing/details require contacting Snowflake, so I will omit deeper discussion here.

Which Edition Should You Choose?

From a practical perspective, the decision criteria often look something like this:

Standard: Evaluation, PoC, small analytics teams
Enterprise: General production workloads (this is where many companies start)
Business Critical: Regulated industries, sensitive data, or mandatory PrivateLink requirements
VPS: Financial/government environments requiring complete isolation

One important point is that if you simply decide to “start with Enterprise,” you cannot use PrivateLink.

Because of this, it is safest to validate networking requirements at the beginning of the project.

Upgrading later is possible, but it may require revisiting architecture and pricing assumptions.

Considering Connectivity with AWS

When using Snowflake on AWS, network design—specifically how clients and applications connect—is tightly related to edition selection.

Here, I would like to organize connectivity approaches from both the Enterprise and Business Critical perspectives.

Connectivity with Enterprise Edition

Enterprise Edition does not support AWS PrivateLink, so connectivity to Snowflake is fundamentally internet-based.

Client / AWS Services
        ↓
    Internet
        ↓
    Snowflake

Hearing “internet-based connectivity” may sound concerning at first.
However, even with Enterprise Edition, it is possible to achieve a practical security level by combining multiple controls.

1. Restricting Source IPs with Network Policies

Snowflake network policies can restrict allowed source IP addresses.

By limiting access to known egress IPs such as:

Corporate network addresses
Elastic IPs attached to NAT Gateways

you can significantly reduce the risk of unauthorized access.

2. Key Pair Authentication + AWS Secrets Manager

For credential management, password authentication should generally be avoided in favor of key pair authentication.

As of 2026, Snowflake is actively moving away from single-factor password authentication.
For system integrations, Key Pair authentication or OAuth is now the recommended approach.

When connecting from services such as AWS Lambda, storing private keys in AWS Secrets Manager is a practical approach.

Event Source
    ↓
AWS Lambda
    ↓
Secrets Manager
    ↓
Retrieve Private Key
    ↓
Snowflake (TLS over Internet)

I covered this approach in a previous article as well:

Securely Implementing Snowflake AWS Lambda Integration with Key Pair Authentication + Secrets Manager

3. TLS Encryption

All communication with Snowflake is encrypted via TLS.

This means the communication channel itself remains confidential.

In other words, even with Enterprise Edition, combining:

IP restrictions
Key Pair authentication
TLS encryption

can provide a practical level of security.

However, traffic still traverses the public internet.

Limitations of Enterprise Edition

Enterprise Edition cannot satisfy requirements such as:

Audit/compliance mandates requiring “no internet-based connectivity”
Internal security policies mandating PrivateLink
Handling regulated data under HIPAA or PCI DSS
Encrypting data using customer-managed KMS keys (Tri-Secret Secure)

Once these requirements appear, Business Critical Edition or higher becomes necessary.

Connectivity with Business Critical Edition

Business Critical Edition supports AWS PrivateLink, enabling private connectivity between AWS VPCs and Snowflake.

In this architecture, traffic remains entirely within the AWS backbone network and never traverses the public internet.

    AWS VPC
        ↓
VPC Interface Endpoint
        ↓
 AWS PrivateLink
        ↓
   Snowflake VPC

High-Level Setup Procedure

Following the official documentation, the configuration can be summarized in several major steps.

1. Enable PrivateLink on the Snowflake Side

Using the ACCOUNTADMIN role, authorize AWS PrivateLink for the Snowflake account.

First, obtain a federation token using AWS STS.

aws sts get-federation-token --name your-user-name

Then execute authorization from the Snowflake side.

USE ROLE ACCOUNTADMIN;

SELECT SYSTEM$AUTHORIZE_PRIVATELINK(
  '<aws_account_id>',
  '<federated_token>'
);

You can verify authorization with the following function.

SELECT SYSTEM$GET_PRIVATELINK('<aws_account_id>', '<federated_token>');

If the response returns:

Account is authorized for PrivateLink.

then authorization succeeded.

2. Retrieve the VPC Endpoint ID

Retrieve the information required for AWS VPC Endpoint creation.

SELECT SYSTEM$GET_PRIVATELINK_CONFIG();

Take note of the privatelink-vpce-id value returned in the JSON response.

This ID becomes the “service name” when creating the VPC Endpoint on AWS.

3. Create the VPC Endpoint on AWS

Create an AWS VPC Interface Endpoint using the following configuration.

Service Name: the privatelink-vpce-id
VPC: the source VPC
Subnets: multi-AZ deployment is recommended
Security Groups: allow ports 443 and 80 from Lambda/EC2

Port 80 is required for OCSP (certificate revocation checking), so do not forget to allow it.

4. Configure DNS

When using PrivateLink, the Snowflake account URL changes to the following format:

<account_identifier>.privatelink.snowflakecomputing.com

You must create a CNAME record mapping the endpoint returned by SYSTEM$GET_PRIVATELINK_CONFIG() to the DNS name of the AWS VPC Endpoint.

Using a Route 53 Private Hosted Zone is the most common approach.

5. Verify Connectivity

Finally, verify connectivity from the client side.

For diagnostics, the SnowCD (Snowflake Connectivity Diagnostic Tool) is useful for validating PrivateLink connectivity.

snowcd <hostfile>

Configuring VPC Endpoints for S3 Access

This is an easy detail to overlook.

Snowflake drivers such as:

JDBC
ODBC
Python Connector

internally access Amazon S3 during data load/unload operations against stages.

Even if Snowflake connectivity itself is private via PrivateLink, S3 traffic may still traverse the public internet unless additional configuration is performed.

Available approaches include:

Creating AWS VPC Interface Endpoints for Snowflake internal stages (recommended)
Creating an S3 Gateway Endpoint to privatize S3 bucket access
Allowing internet-based S3 access (strongly discouraged)

If you want a fully private architecture, Snowflake officially recommends creating VPC Interface Endpoints for internal stages.

Blocking Public Access

After establishing PrivateLink connectivity, you can also block public access from the Snowflake side.

This allows only PrivateLink-based connectivity.

CREATE NETWORK POLICY privatelink_only
  ALLOWED_IP_LIST = ('10.0.0.0/8');

ALTER ACCOUNT SET NETWORK_POLICY = privatelink_only;

Snowflake also provides:

SYSTEM$ENFORCE_PRIVATELINK_ACCESS_ONLY
“Enforce privatelink-only access”

which are also valid approaches.

Combining VPN-based corporate IP ranges with PrivateLink-only access can create an even more secure architecture.

Leveraging Tri-Secret Secure

Business Critical Edition also supports Tri-Secret Secure using AWS KMS customer-managed keys (CMKs).

This mechanism requires both:

Snowflake-managed keys
Customer-managed keys

as an AND condition for decryption.

Even if Snowflake itself were compromised, data could not be decrypted without the customer-managed key.

Combining:

PrivateLink
Tri-Secret Secure

creates a very strong architecture for regulatory compliance.

I have not personally implemented this feature, so I will omit further details here.

Cross-Region Connectivity

AWS PrivateLink is fundamentally designed for same-region connectivity.

However, Business Critical Edition and above also support cross-region connectivity.

For example:

Snowflake account in US-EAST
AWS VPC in AP-NORTHEAST-1 (Tokyo)

can still communicate privately via PrivateLink.

That said, there are several caveats:

PaaS services such as S3 and KMS do not support cross-region PrivateLink
Government and China regions are not supported
“Enable Cross Region Endpoint” must be enabled in the VPC console

In practice, aligning the Snowflake region with the application region generally results in a simpler and easier-to-operate architecture.

Still, for globally distributed data platforms, these considerations become important.

Balancing Edition Selection and Cost

Business Critical Edition provides major security advantages, but the credit cost is roughly 1.3x higher than Enterprise Edition.

As rough on-demand reference pricing for US East in 2026:

Standard: approximately $2/credit
Enterprise: approximately $3/credit
Business Critical: approximately $4/credit

If you have a strict requirement that traffic must never traverse the public internet, Business Critical is effectively the only option.

However, from a practical standpoint, balancing data sensitivity and cost often leads to architectures such as:

Production data warehouse (including sensitive data): Business Critical
Development / testing environments: Enterprise
Dedicated data sharing accounts: Enterprise

Using multiple editions strategically within the same organization can also be a reasonable approach.

Conclusion

In this article, including some reflections from my own experience, I introduced the differences between Snowflake editions and explored secure AWS connectivity using Business Critical Edition.

The key points are:

Snowflake provides four editions (Standard / Enterprise / Business Critical / VPS), with higher editions adding stronger security and compliance capabilities
AWS PrivateLink requires Business Critical Edition or higher, so networking/security requirements should be validated early
Even Enterprise Edition can achieve reasonable security through network policies, Key Pair authentication, and TLS
Business Critical enables private connectivity between AWS VPCs and Snowflake through PrivateLink, fully isolating traffic from the public internet
S3 access must also be privatized, so VPC Endpoints for internal stages should be configured as well
Combining Tri-Secret Secure with PrivateLink enables architectures well suited for regulatory compliance

I think many teams struggle specifically with deciding between Enterprise Edition and Business Critical Edition.

Although edition upgrades are possible later, they can significantly impact both architecture and cost.
For that reason, it is best to organize these requirements carefully during the early stages of requirements definition and architecture design.

I hope this article helps anyone looking to use Snowflake securely on AWS.

What Is Apache Polaris? Why Open Data Catalogs Matter and How to Use Them with AWS

Aki — Sat, 02 May 2026 06:27:16 +0000

Original Japanese article: Apache Polarisとは何か？オープンなデータカタログが求められる理由とAWSとの組み合わせ方を整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In recent years, lakehouse architectures centered around Apache Iceberg have been rapidly expanding.

By placing Iceberg tables on object storage such as S3, it has become possible to query the same data from multiple engines such as Athena, Snowflake, Spark, Trino, and Dremio.
As a result, the discussion has shifted from “Where should data be placed, and which engine should be used for analysis?” to “Where should data ownership reside, and which catalog should be used to unify governance?”

Amid this trend, Apache Polaris has been attracting attention in recent years.
Apache Polaris is an open-source implementation of the Iceberg REST Catalog, led by Snowflake and donated to the Apache Software Foundation.

Multiple vendors—including Dremio, AWS, Google, Microsoft, and Confluent—are contributing to it, and it is positioned as an “open catalog” that enables cross-platform management of Iceberg tables while avoiding vendor lock-in.

In this article, I would like to think through the following:

What Apache Polaris is
Why open data catalogs are required
Differences from AWS Glue Data Catalog
Differences from Snowflake Horizon Catalog
How responsibilities should be divided when combining with AWS

In conclusion, Apache Polaris is not something that competes with AWS Glue Catalog or Snowflake Horizon Catalog; rather, they are catalogs that operate at different layers.

It may be easier to understand Apache Polaris as a component that enables an architecture such as:
“The data itself resides in AWS, the catalog is open, and analysis engines are selected based on use cases.”

What is Apache Polaris?

Apache Polaris is an open-source catalog implementation compliant with the Apache Iceberg REST Catalog specification.
It was announced by Snowflake in 2024 and later became an incubation project under the Apache Software Foundation.

The project has now graduated from incubation and has been promoted to a top-level Apache project.

Official site:
https://polaris.apache.org/

What Polaris aims to achieve is a common metadata and governance foundation in a lakehouse centered around Iceberg tables.

A major characteristic is that it is not tied to any specific query engine or cloud vendor, and anyone can access it using the same specification via REST APIs.

Key Features of Apache Polaris

Feature	Description
Implementation of Iceberg REST Catalog	Accessible via standardized REST APIs. Can be directly used from engines such as Spark, Trino, Flink, Snowflake, and Dremio
Multi-catalog architecture	Multiple catalogs can be defined within a single Polaris instance. Enables separation and management by team or business domain
RBAC (Role-Based Access Control)	Provides a permission model combining principals, principal roles, and catalog roles
External catalog integration	Can connect to other catalogs compliant with the Iceberg REST specification (e.g., Nessie, Gravitino)
OSS / Managed support	Can be self-hosted as OSS, or used as managed offerings such as Snowflake Open Catalog or Dremio Catalog

What Apache Polaris Solves

As Apache Iceberg has become more widely adopted, multiple Iceberg-compatible catalogs have emerged, including Hive Metastore, JDBC, Nessie, AWS Glue, and Snowflake.

Since each has its own client libraries and interfaces, the following challenges have arisen:

The need to implement catalog clients for each programming language
Inconsistent access control specifications across catalogs
Difficulty enforcing governance across multiple catalogs
As a result, the overall architecture becomes constrained by the chosen catalog

To solve these challenges, the Iceberg REST Catalog specification was introduced.
Apache Polaris is an open-source implementation of that specification, further enhanced with multi-catalog support and RBAC.

In other words, you can think of it as an open catalog for Apache Iceberg.

Polaris Security Model

The Polaris security model can be organized into the following three concepts:

Principal: An entity representing a user or service. Accesses Polaris via client ID/secret, etc.
Principal Role: A grouping of multiple catalog roles. Assigned to principals
Catalog Role: A set of permissions within a specific catalog. Includes permissions such as TABLE_READ_DATA, TABLE_CREATE, and NAMESPACE_LIST

For example, you can design it such that:

The data_engineer principal role is assigned both write access to prod_catalog and administrative access to dev_catalog
The data_analyst principal role is assigned only read access to prod_catalog

An important point is that RBAC is centralized on the catalog side, eliminating the need to implement access control separately for each engine.

Why Open Data Catalogs Are Required

Let us first consider why open data catalogs are required in the first place.

Separation of Data and Engines Has Become a Premise

The greatest value of open table formats such as Apache Iceberg is the ability to separate data storage from query engines.

It has become possible to freely choose engines such as Athena, Glue, Spark, Snowflake, Dremio, and DuckDB depending on the use case when querying Iceberg tables on S3.

As a result, the key question in data platforms has shifted from “Which product should we use?” to “Where should data ownership reside, and who should be responsible for governance at which layer?”

However, while engines can now be freely selected, the remaining challenge is the catalog.

What Happens When Catalogs Are Tied to Engines

When using catalogs tightly coupled with query engines, the following situations tend to occur:

The data itself is open (S3 + Iceberg), but the catalog is tied to a specific engine
You want to reference the same table from another engine, but the catalog does not support it
Access control is fragmented across engines, making governance difficult
Every time the catalog is changed, all engine-side configurations must be redone

In other words, even if storage and formats are open, a closed catalog significantly reduces the benefits of a lakehouse.

Especially in today’s environments where multi-cloud, multiple products, and multiple engines are commonly combined, how to unify catalogs becomes a key challenge.

Requirements for an Open Catalog

Based on this background, lakehouse catalogs are expected to meet the following requirements:

Requirement	Description
Compliance with standard APIs	Support vendor-neutral APIs such as the Iceberg REST Catalog specification
Multi-engine support	Usable across engines such as Spark, Trino, Flink, Snowflake, and Dremio
Centralized RBAC	Define permissions at the catalog level and apply consistent governance across all engines
Multi-cloud / hybrid	Not dependent on a specific cloud and capable of running on-premises when necessary
OSS sustainability	Not discontinued based on vendor decisions; continuously developed in a community-driven manner

Apache Polaris is a catalog designed to satisfy these requirements.

Differences from AWS Glue Data Catalog

When building on AWS, AWS Glue Data Catalog is often positioned as the central data catalog.
Here, we will organize the differences between AWS Glue Data Catalog and Apache Polaris.

Positioning of AWS Glue Data Catalog

AWS Glue Data Catalog is a core metadata management service in AWS.

It is natively integrated with AWS analytics services such as Athena, Glue, Redshift Spectrum, and EMR, and plays the role of managing data on S3 as a catalog.

As discussed in previous articles, Glue Data Catalog is an excellent technical catalog used by data platforms.

Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies

Functional Comparison

Aspect	AWS Glue Data Catalog	Apache Polaris
Offering	AWS-managed (closed)	OSS / Managed (Snowflake Open Catalog, Dremio Catalog, etc.)
API	AWS proprietary API (recently also provides Iceberg REST compatibility)	Iceberg REST Catalog specification (open)
Cloud support	AWS	Multi-cloud / on-prem
Engines	Athena, Glue, Redshift, EMR, Spark	Spark, Trino, Flink, Snowflake, Dremio, StarRocks, DuckDB
Multi-catalog	Account-level (logical separation via Lake Formation)	Native support for multiple catalogs within a single instance
Access control	IAM + Lake Formation	Built-in RBAC (Principal / Principal Role / Catalog Role)
External catalog integration	Limited	Can integrate with Iceberg REST-compliant catalogs (Nessie, Gravitino, etc.)
Non-Iceberg formats	Supports Hive, JSON, CSV, Parquet, etc.	Currently Iceberg-centric (Generic Table support on roadmap)

How to Interpret the Difference

Rather than being in a competitive relationship, it is easier to understand them as catalogs with different roles.

AWS Glue Data Catalog: Strong integration with AWS services, making it the primary choice for workloads completed within AWS. It supports a wide range of data lake formats beyond Iceberg and features such as S3 crawling.
Apache Polaris: A catalog that enables governance across multiple engines and clouds based on the industry-standard Iceberg REST API. It is effective when you want to enforce consistent RBAC across engines outside AWS (e.g., Snowflake, Dremio).

In summary:

If your use case is AWS-contained and includes formats beyond Iceberg, Glue Data Catalog is a practical choice
If you want common management of Iceberg across multiple engines and a vendor-neutral catalog layer, Polaris is suitable

Differences from Snowflake Horizon Catalog

This is often confused, so let’s clarify the difference between Snowflake Horizon Catalog and Apache Polaris.
Note that it is different from “Snowflake Open Catalog,” despite the similar name.

What is Snowflake Horizon Catalog?

Snowflake Horizon Catalog is a data governance and discovery suite provided by Snowflake.

For data managed within Snowflake (Snowflake-managed tables, stages, views, shared data, etc.), it provides:

Data discovery (search, tagging, descriptions)
Lineage
Data quality monitoring
Masking policies and row access policies
Automatic classification of sensitive data
Compliance management

In terms of positioning, it is similar to Amazon DataZone + Lake Formation + Glue Data Quality in AWS.

In other words, it is the layer responsible for cataloging and governance so that people can discover, understand, and trust data.

What is Snowflake Open Catalog (Relation to Polaris)

On the other hand, Snowflake Open Catalog is a managed offering of Apache Polaris.

Although the name is confusing, this is the lakehouse catalog that serves as an Iceberg REST Catalog.

In Snowflake’s model:

Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake-managed data
Snowflake Open Catalog (= Apache Polaris): Lakehouse catalog layer for open table formats such as Iceberg

Functional Comparison

Aspect	Snowflake Horizon Catalog	Apache Polaris
Primary target	Data in Snowflake (internal tables, shared data, etc.)	Iceberg (Generic Table support for other formats is planned)
Layer	Business catalog / governance layer	Lakehouse catalog layer (technical catalog)
Offering	Built into Snowflake (closed)	OSS / Managed
API	Snowflake proprietary	Iceberg REST Catalog specification (open)
Data location	Snowflake internal storage or recognized external data	Iceberg tables on cloud storage
Scope	Within Snowflake organizations	Across multiple engines and clouds

How to Interpret the Difference

Again, these are not in opposition but complementary.

Snowflake Horizon Catalog: Upper layer that provides data to business users, handling discovery, quality, masking, etc.
Apache Polaris: Lower layer (metadata foundation) that exposes Iceberg tables to multiple engines

Conceptually, the structure looks like this:

┌──────────────────────────────────────────────┐
│  Business Catalog / Governance Layer         │ ← Snowflake Horizon Catalog
│  (Discovery / Lineage / Quality / Masking)   │   Amazon DataZone, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Lakehouse Catalog Layer                     │ ← Apache Polaris
│  (Iceberg REST Catalog / RBAC)               │   AWS Glue Data Catalog, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Data Lake (S3 / GCS / Azure Blob)           │
│  Iceberg / Parquet                           │
└──────────────────────────────────────────────┘

If you think of Snowflake Horizon Catalog and Apache Polaris as “choosing one or the other,” it feels unnatural, but when organized as different layers, the division of responsibilities becomes clear.

How to Combine with AWS

From here, we will consider cases where Apache Polaris is introduced into an AWS environment.
Since AWS already has a powerful catalog called Glue Data Catalog, it is important to clarify how Polaris should be positioned and who is responsible for what.

Expected Architecture

Representative configurations can be organized into the following three patterns.

Pattern 1: AWS-only (Glue Data Catalog-centered)

This is the simplest configuration.
It is a typical setup using S3 + Iceberg + Glue Data Catalog, along with Athena / Glue / Redshift Spectrum.

Catalog: AWS Glue Data Catalog
Governance: IAM + Lake Formation
Query engines: Athena, Redshift Spectrum, Glue ETL, EMR

If everything is completed within AWS and there is no strong need to share with external engines, this configuration remains the most practical.
There is no need to forcibly introduce Apache Polaris.

Pattern 2: AWS + Snowflake (Using Polaris as a shared catalog foundation)

This configuration is effective when you want to reference the same Iceberg tables from both AWS (e.g., Athena) and Snowflake.

Data storage: S3 + Iceberg
Catalog: Apache Polaris (OSS self-hosted or Snowflake Open Catalog)
AWS side: Reference Polaris as an Iceberg REST Catalog (via Spark or third-party tools)
Snowflake side: Connect to Polaris using External Volume and Catalog Integration (CATALOG_SOURCE = POLARIS)

From the Snowflake side, Polaris can be referenced directly as follows:

CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog_int
  CATALOG_SOURCE = POLARIS
  TABLE_FORMAT = ICEBERG
  REST_CONFIG = (
    CATALOG_URI = 'https://<polaris-host>/api/catalog'
    CATALOG_NAME = '<your_polaris_catalog>'
  )
  REST_AUTHENTICATION = (
    TYPE = OAUTH
    OAUTH_CLIENT_ID = '<polaris_client_id>'
    OAUTH_CLIENT_SECRET = '<polaris_client_secret>'
    OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
  )
  ENABLED = TRUE;

Pattern 3: Multi-engine / multi-cloud configuration

In addition to Snowflake, this configuration includes multiple engines such as Dremio, Databricks, Trino, and Flink.

In this case, all engines reference Polaris as a common Iceberg REST Catalog.

Data storage: S3 (and other cloud storage if needed)
Catalog: Apache Polaris (center of governance)
Query engines: Snowflake, Dremio, Spark, Trino, Flink, etc.
Governance: Polaris provides unified RBAC across all engines

How to Think About Responsibility Separation

This is the key point.
When combining Polaris, AWS, Snowflake, and others, it is important to clearly define who is responsible for which layer.

Layer	Primary Owner	Notes
Data storage (files)	AWS (S3)	Storage location of the data. Single Source of Truth
Storage access control	AWS (IAM)	Access permissions to S3 buckets/prefixes are defined on the AWS side
Table metadata	Apache Polaris	Source of Truth for Iceberg metadata such as schema, snapshots, partitions
Table-level RBAC	Apache Polaris	Applies consistent permission rules across engines
ETL / pipelines	AWS Glue / Lambda / EMR / Spark	Responsible for ingestion and transformation
Query execution	Athena / Snowflake / Dremio / Spark	Engines selected based on use case
Business catalog / discovery	Snowflake Horizon Catalog / Amazon DataZone	Higher-layer features for search, lineage, quality for users
Data quality	Glue Data Quality / Snowflake DMF	Implemented at engine or quality service layer

What is especially important is the three-layer separation:

Data resides in AWS, the catalog is Polaris, and usage is handled by each engine

By making this separation explicit:

AWS can focus on storage and IAM management
Polaris can focus on metadata and access control
Each query engine can focus on its strengths

Considerations When Adopting Polaris

Polaris is powerful, but there are also important considerations:

Operational cost when self-hosting OSS: Running on EKS or EC2 requires a metastore (e.g., PostgreSQL), authentication infrastructure, monitoring, and upgrade handling
Managed services are often more practical: Using Snowflake Open Catalog or Dremio Catalog significantly reduces operational burden
Less seamless integration with AWS services compared to Glue: For AWS-native services such as Athena, Redshift, and QuickSight, using Glue Data Catalog is far more straightforward
Need to avoid double governance: If IAM policies on S3 and RBAC in Polaris are inconsistent, troubleshooting becomes complex

In other words, when deciding whether to adopt Apache Polaris in an AWS environment, it is realistic to evaluate based on:

Whether multi-engine requirements exist
The organization’s stance on vendor lock-in
Whether operational cost is acceptable (or managed services can be used)

A Practical Approach

Personally, when considering Polaris in an AWS environment, the following phased approach is practical:

Build a lakehouse within AWS using Glue Data Catalog + Iceberg
When integration with other engines such as Snowflake becomes necessary, consider introducing an Iceberg REST layer
At that point, compare “Glue Iceberg REST endpoint,” “Apache Polaris OSS,” and “Snowflake Open Catalog” based on requirements
If multi-engine / multi-cloud requirements become clear, redesign with Polaris (especially managed) at the center

Rather than designing with Polaris from the beginning, it is often more practical to replace the catalog layer with an open one when requirements mature.

Conclusion

In this article, we organized the key points around Apache Polaris.

In the world of data platforms, while storage and formats have become open, a closed catalog reduces the benefits of a lakehouse by half.

Therefore, there is a need for an open catalog that complies with the Iceberg REST Catalog specification and enables unified governance across multiple engines and clouds.
Apache Polaris is designed to fulfill exactly that role.

However, it is important to think not in terms of “which one to choose” among Polaris, AWS Glue Data Catalog, and Snowflake Horizon Catalog, but rather which layer each is responsible for:

AWS Glue Data Catalog: Technical catalog within AWS (still the primary choice for AWS-only workloads)
Apache Polaris: Lakehouse catalog centered on Iceberg, shared across multiple engines
Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake users

Even when combining with AWS, by consciously separating responsibilities as
“data in AWS, catalog in Polaris, analytics in engines, business catalog in another layer”,
you can design an architecture that leverages the strengths of each.

Going forward, lakehouse architectures are expected to increasingly adopt vendor-neutral designs.
Apache Polaris is likely to become an important component supporting that openness.

I hope this article will be helpful for those considering Apache Polaris or designing lakehouse architectures across multiple platforms such as AWS and Snowflake.

Lightweight ETL on AWS Lambda Using DuckDB and Snowflake Connector

Aki — Sat, 04 Apr 2026 13:54:10 +0000

Original Japanese article: AWS Lambda × DuckDB × Snowflake ConnectorによるETLの実装

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my previous article, I introduced how to connect to Snowflake from AWS Lambda using Key Pair authentication.

Securely Implementing Snowflake AWS Lambda Integration with Key Pair Authentication + Secrets Manager

This time, I would like to try the event-driven data ingestion approach that I introduced in the previous article.

In this article, I will implement an event-driven ETL pipeline that uses DuckDB on AWS Lambda to perform lightweight transformations on Parquet files stored in Amazon S3 and then load the processed data into Snowflake.

In addition, during the implementation process, I encountered an interesting limitation where write_pandas fails when writing to a Catalog-Linked Database. I will also summarize the root cause and the workaround.

Why Snowpipe Is Not Enough

Snowpipe is a very convenient feature for automatic data ingestion.

However, it has limitations when it comes to data transformation and complex filtering.

In other words, when you need preprocessing, filtering, or the integration of multiple events, you need to choose another approach.

In such cases, AWS Lambda becomes a strong option due to its high flexibility.

Architecture

A Parquet file is uploaded to S3
Lambda is triggered by the S3 event
DuckDB reads the data and performs the required transformations
snowflake.connector writes the data into Snowflake

The two key libraries used in this implementation are shown below.

DuckDB

DuckDB is an embedded database engine designed for OLAP (Online Analytical Processing).

Because DuckDB is extremely lightweight and supports in-memory processing, it can run efficiently even in a simple execution environment such as AWS Lambda.

It is said to provide particularly strong performance for batch workloads such as data analytics and ETL processing.

In addition, it enables SQL-based filtering and lightweight data transformations, allowing for intuitive implementations.

https://duckdb.org/

Snowflake Connector

Snowflake Connector for Python is a library that provides an interface for connecting to Snowflake and executing all standard operations.

By using this library, it becomes possible to operate Snowflake from runtime environments such as Lambda.

https://docs.snowflake.com/en/developer-guide/python-connector/python-connector

Sample Code

In the sample code below, WHERE VendorID = 1 is added as an ETL filter.

By performing filtering and data transformation inside Lambda, highly flexible preprocessing becomes possible.

import duckdb
import boto3
import json
import snowflake.connector
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.backends import default_backend
from snowflake.connector.pandas_tools import write_pandas
import time
import random

SECRET_ID = "snowflake-keypair"


def get_secret():
    client = boto3.client("secretsmanager")
    response = client.get_secret_value(SecretId=SECRET_ID)
    return json.loads(response["SecretString"])

def lambda_handler(event, context):
    conn = None
    duckdb_connection = None

    try:
        duckdb_connection = duckdb.connect(database=":memory:")
        duckdb_connection.execute("SET home_directory='/tmp'")
        duckdb_connection.execute("INSTALL httpfs")
        duckdb_connection.execute("LOAD httpfs")

        s3_bucket = event["Records"][0]["s3"]["bucket"]["name"]
        s3_object_key = event["Records"][0]["s3"]["object"]["key"]

        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
        print(f"S3 input path: {s3_input_path}")


        query = f"""
            SELECT *
            FROM read_parquet('{s3_input_path}')
            WHERE VendorID = 1
        """

        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        print(f"DuckDB filtered rows: {result_arrow_table.num_rows}")

        secret = get_secret()

        private_key_obj = serialization.load_pem_private_key(
            secret["privateKey"].encode("utf-8"),
            password=None,
            backend=default_backend()
        )

        conn = snowflake.connector.connect(
            user=secret["user"],
            account=secret["account"],
            private_key=private_key_obj,
            role=secret.get("role"),
            warehouse=secret.get("warehouse"),
            database=secret.get("database"),
            schema=secret.get("schema")
        )

        cur = conn.cursor()
        cur.execute(f"USE DATABASE {secret['database']}")
        cur.execute(f"USE SCHEMA {secret['schema']}")
        cur.close()

        import pandas as pd

        df = result_arrow_table.to_pandas()

        df["tpep_pickup_datetime"] = pd.to_datetime(
            df["tpep_pickup_datetime"],
            unit="us"
        ).dt.strftime("%Y-%m-%d %H:%M:%S")

        df["tpep_dropoff_datetime"] = pd.to_datetime(
            df["tpep_dropoff_datetime"],
            unit="us"
        ).dt.strftime("%Y-%m-%d %H:%M:%S")

        success, nchunks, nrows, _ = write_pandas(
            conn,
            df,
            table_name="YELLOW_TRIPDATA"
        )

        print(
            f"Snowflake write success={success}, "
            f"rows={nrows}, chunks={nchunks}"
        )

        return {
            "statusCode": 200,
            "body": (
                f"Processed {result_arrow_table.num_rows} rows "
                f"and wrote {nrows} rows to Snowflake."
            )
        }

    except Exception as e:
        print(f"An error occurred: {e}")
        import traceback
        traceback.print_exc()

        return {
            "statusCode": 500,
            "body": str(e)
        }

    finally:
        if conn:
            conn.close()

        if duckdb_connection:
            duckdb_connection.close()

Execution Result

As shown above, the data was successfully written.

Switching the Destination to a Catalog-Linked Database

As introduced in a previous article, what happens if we try writing to a table configured with a Catalog-Linked Database (Iceberg)?

Let’s test it.

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

A Write Error Occurs

When attempting to write to the Catalog-Linked Database, the following error occurred:

{
  "statusCode": 500,
  "body": "093678 (0A000): SQL Compilation Error: This operation is not supported in a catalog-linked database."
}

Why Writing to a Catalog-Linked Database Fails

The reason this happens is due to the interaction between write_pandas in the Snowflake Connector and the constraints of a Catalog-Linked Database.

Internally, write_pandas creates a temporary stage.

Python Connector API

Writes a pandas DataFrame to a table in a Snowflake database.
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO <table> command to copy the data from the files to the table.

You can use some of the function parameters to control how the PUT and COPY INTO <table> statements are executed.

However, stage creation is not supported in a Catalog-Linked Database.

Considerations for using a catalog-linked database for Iceberg tables

You can create schemas, externally managed Iceberg tables, or database roles in a catalog-linked database. Creating other Snowflake objects isn't currently supported.

This conflict causes write_pandas to fail with the error:

This operation is not supported in a catalog-linked database.

More specifically, the temporary stage created internally falls under the category of “other Snowflake objects,” so the error occurs at the point where CREATE TEMPORARY STAGE is executed.

That said, there is a workaround.

How to Write to a Catalog-Linked Database

A relatively simple approach is to use an INSERT statement directly.

Here is an example implementation:

        for i in range(0, len(df), 1000): 
            chunk = df.iloc[i:i+1000]
            columns = ", ".join(chunk.columns)
            placeholders = ", ".join(["%s"] * len(chunk.columns))
            sql = f"INSERT INTO {secret['schema']}.YELLOW_TRIPDATA ({columns}) VALUES ({placeholders})"
            cur.executemany(sql, chunk.values.tolist())

Alternatively, another good approach is to create a stage in a different database and execute the INSERT through that route.

Conclusion

In this article, I implemented a lightweight event-driven ETL pipeline triggered by S3 events using AWS Lambda, DuckDB, and the Snowflake Connector.

By using DuckDB inside Lambda, I was able to perform SQL-based filtering and lightweight transformations directly on Parquet files stored in S3, and successfully load the processed results into Snowflake.

In addition, I confirmed an important limitation: when using write_pandas against a Catalog-Linked Database (Iceberg), the write fails because the connector internally creates a temporary stage.

Although there are some constraints, combining DuckDB and the Snowflake Connector enables the construction of a low-cost and flexible data processing pipeline for Snowflake.

The key point is to clearly understand how Snowflake manages Iceberg tables.

It is important to determine whether the table is a Snowflake-managed Iceberg table or connected through mechanisms such as a Catalog-Linked Database, and to properly understand that structure.

In any case, the combination of Snowflake and Iceberg is an extremely powerful option for building a Lakehouse architecture.

I hope this article will be helpful for those considering lightweight data processing and real-time ETL pipelines with AWS and Snowflake when working with Iceberg tables.

DEV Community: Aki

Verifying How IAM and Lake Formation Behave for the Glue REST Catalog and S3 Tables

Introduction

What We're Verifying Today

Test Environment Setup

IAM ✓ / LF ✓ (Baseline State)

Glue Endpoint

S3 Tables Endpoint

Checking CloudTrail (Glue Endpoint → GetDataAccess)

Checking CloudTrail (the S3 Tables Endpoint's Own Call)

IAM ✓ / LF ✗ (Removing the LF Grant)

Glue Endpoint

S3 Tables Endpoint

Checking CloudTrail (Glue Endpoint → GetDataAccess)

IAM ✗ / LF ✓ (Removing the s3tables IAM Actions)

Glue Endpoint

S3 Tables Endpoint

Checking CloudTrail (Glue Endpoint → GetDataAccess)

Results Summary

Authorization Flow

Glue REST Catalog

S3 Tables REST Endpoint

Conclusion

Hitting the Iceberg REST Catalog Directly: Understanding the Differences Between Glue Data Catalog and S3 Tables

Introduction

What Is the Iceberg REST Catalog?

Setting Up the Test Environment

SigV4 Signing

Hitting the S3 Tables Iceberg REST Endpoint

GET /v1/config

Listing Namespaces and Tables

Retrieving Metadata with LoadTable

Checking What's Happening Under the Hood via CloudTrail

Hitting the Glue Iceberg REST Endpoint

GET /v1/config

Prefix Rules: Encoding the Catalog Hierarchy

Listing Namespaces (Default Catalog)

Reading an S3 Tables Table Through the Glue Endpoint

Access Control Differences (Lake Formation)

Summarizing the Differences Between the Two Endpoints

Basic Structure

Where to Use Which, and Gotchas

Looking Ahead (Some Personal Thoughts)

Conclusion

Does Amazon S3 Tables Replace AWS Glue Data Catalog? Understanding Their Relationship

Introduction

Understanding the Roles of S3 Tables and Glue Data Catalog

Understanding Federated Catalogs

Trying It Out

Creating a Table Bucket

Querying from Athena

Viewing It from Glue Data Catalog

Existing Access Control Mechanisms Continue to Work

If Nothing Is Being Replaced, What Actually Changed?

When Should You Use It?

My Thoughts on the Future

Conclusion

Track Apache Iceberg Schema Changes in AWS Glue Data Catalog with aws glue get-table-versions

Introduction

What is get-table-versions?

Isn't the Console Enough?

Basic Command Syntax

Example Output

Characteristics of Iceberg Tables

Viewing Schema Change History

List Columns by Version

Compare Differences Between Versions

Find Versions Containing a Specific Column

Identify Who Changed the Schema with CloudTrail

Conclusion

Organizing How to Use AWS Lake Formation

Introduction

What Is Lake Formation?

How Lake Formation Differs from IAM

Isn't IAM Enough?

What Lake Formation Solves

The Relationship Between IAM and Lake Formation

Lake Formation Permission Model

Column-Level and Row-Level Access Control

Column-Level Security

What is `get-table-versions`?