DEV Community

Cover image for Governance & Cross-Platform Access: Lake Formation, PII Anonymization, and Multi-Engine Reality for S3 Tables

Governance & Cross-Platform Access: Lake Formation, PII Anonymization, and Multi-Engine Reality for S3 Tables

Previously...

In Part 1, we built the metadata catalog. In Part 2, we added AI classification and vector search. Now we need to answer the hard questions:

  • Who can see what? (governance)
  • What about PII? (anonymization)
  • Can Databricks/Snowflake access this? (cross-platform)

Lake Formation: Governance on Unstructured Data

The Problem

Unstructured data on NAS storage may be well protected at the file-system layer, but it is often not consistently classified, searchable, or governed from analytics and AI workflows:

  • No unified classification → you may not know what's sensitive across the entire corpus
  • File-system permissions exist, but analytics/AI tools can't leverage them for discovery
  • Audit trails may exist at the file-system layer, but they are often not unified with analytics and AI query activity

The Solution

With metadata in S3 Tables (Iceberg), Lake Formation provides:

┌───────────────────────────────────────────────────┐
│  Lake Formation                                   │
│                                                   │
│  Table-level:  SELECT, DESCRIBE                   │
│  Column exposure: controlled via Athena Views     │
│                   (hide embedding_vector, paths)  │
│  Row filtering: WHERE sensitivity_level = 'public'│
│  Audit:        CloudTrail logs metadata queries   │
└───────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Verified: Access Control in Action

Step 1: Authorized user queries metadata
  → ✅ SUCCEEDED (3 rows returned)

Step 2: Revoke SELECT permission
  → 🔒 BLOCKED: "Column 'file_name' cannot be resolved
     or requester is not authorized"

Step 3: Restore permission
  → ✅ SUCCEEDED (access restored)

Step 4: CloudTrail audit
  → All queries logged with user identity and timestamp
Enter fullscreen mode Exit fullscreen mode

Every query against the metadata table is governed and audited. This gives you 100% metadata query governance coverage in this PoC. Raw file access remains governed separately by FSx for ONTAP file-system permissions, S3 Access Point policies, and application-specific access paths.

Lake Formation Governance Status

Capability Status Notes
Table-level SELECT / DESCRIBE ✅ Verified Grant/revoke works correctly
Athena query governance ✅ Verified Unauthorized access blocked
CloudTrail audit logging ✅ Verified All queries logged with user identity
Column-level exclusion (ColumnWildcard) ⚠️ Failed On tested S3 Tables federated catalog path
Row-level filtering / LF-Tags 📋 Design pattern Taxonomy defined, needs validation
Column exposure via Athena Views ✅ Workaround Recommended alternative to column-level grants

Observed Limitation: Column-Level Grants on This S3 Tables Federated Catalog Path

In this PoC, table-level Lake Formation SELECT grants worked as expected. However, column exclusion grants using ColumnWildcard with ExcludedColumnNames returned InvalidInputException: Permissions modification is invalid against the s3tablescatalog/... federated catalog path we tested.

AWS documentation describes table, column, and row-level permissions for S3 Tables integrated with Lake Formation. Therefore, treat this as an observed limitation in our specific validation path (CLI command, region, catalog ID, engine version), not a confirmed general product limitation. The exact error and test conditions are recorded in the verification evidence.

Workaround: Create Athena Views that expose only permitted columns:

-- View for general users (no embeddings, no PII paths)
CREATE VIEW metadata.public_files AS
SELECT file_id, file_name, file_type, classification, confidence_score
FROM "s3tablescatalog/fsxn-metadata-catalog"."metadata"."unstructured_files"
WHERE is_deleted = false AND sensitivity_level = 'public';

-- Apply Lake Formation on the view
-- Users query the view, not the base table
Enter fullscreen mode Exit fullscreen mode

Governance model choice: For simple use cases, table/column-level permissions suffice. For dynamic, attribute-based access (e.g., "only files classified as 'public'"), use LF-Tags. For enterprise SSO integration, combine with IAM Identity Center. For enterprise governance, map sensitivity_level, path_classification, tenant_id, and pii_status to LF-Tags. See governance/lf-tag-taxonomy.yaml.

Untested alternative: Registering the S3 Tables table in a standard (non-federated) Glue Catalog may enable column-level permissions. This requires manual Iceberg metadata location configuration and has not been verified.

PII Detection: English + Japanese

The Challenge

Amazon Comprehend's detect_pii_entities API supports only English and Spanish. For Japanese PII (names, addresses, My Number), we need a different approach.

Dual-Engine Architecture

Language Engine Detectable PII Latency Cost
English Amazon Comprehend NAME, EMAIL, PHONE, ADDRESS, SSN, CREDIT_CARD, DATE_TIME ~200ms $0.0001/100 chars
Japanese Bedrock Claude 氏名, メール, 電話, 住所, マイナンバー, クレジットカード, 生年月日 ~2-5s ~$0.003/request

Data privacy note: When using Bedrock Claude for PII detection, document text is sent to the Bedrock API. Per AWS's data privacy policy, Bedrock does not store or use your inputs/outputs to train models. For highly sensitive workloads, consider VPC endpoints and AWS PrivateLink for Bedrock access.

Japanese PII Detection (Verified)

# Bedrock Claude detects Japanese PII via prompt
response = bedrock.invoke_model(
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    body=json.dumps({
        "messages": [{"role": "user", "content":
            f"Detect all PII in this text. Return JSON array: "
            f'[{{"type":"...","value":"...","begin":N,"end":N}}]\n\n'
            f"Text:\n{japanese_text}"}]
    })
)
Enter fullscreen mode Exit fullscreen mode

Results on a controlled synthetic sample (not real personal data):

PII Type Detected Value
NAME 山田太郎
EMAIL taro.yamada@example.co.jp
PHONE 090-1234-5678
ADDRESS 〒150-0002 東京都渋谷区渋谷1-2-3
MY_NUMBER 1234 5678 9012
CREDIT_CARD 4111-1111-1111-1111
DATE_OF_BIRTH 1985年3月15日

Anonymization Pipeline

Original document
       │
       ▼
PII Detection (Comprehend or Bedrock)
       │
       ├─ No PII → has_pii = false (no action needed)
       │
       └─ PII found → has_pii = true
                          │
                          ▼
              Redaction: all PII → [REDACTED]
                          │
                          ▼
              Store anonymized version
              anonymization_status = "completed"
Enter fullscreen mode Exit fullscreen mode

Before:

Name: Taro Yamada
Email: taro.yamada@example.com
Phone: 090-1234-5678
SSN: 123-45-6789
Enter fullscreen mode Exit fullscreen mode

After:

Name: [REDACTED]
Email: [REDACTED]
Phone: [REDACTED]
SSN: [REDACTED]
Enter fullscreen mode Exit fullscreen mode

Data Clean Room Pattern

┌─────────────────────────────────────────┐
│  Restricted Table (full metadata)       │
│  • has_pii, anonymized_path, raw paths  │
│  • Access: Security team only           │
│  • Lake Formation: strict SELECT grant  │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│  Public Table (anonymized metadata)     │
│  • classification, summary (redacted)   │
│  • No PII, no raw file paths            │
│  • Access: All analysts                 │
│  • Lake Formation: broad SELECT grant   │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Encryption and Data Residency

  • At rest: S3 Tables uses SSE-S3 encryption by default. All metadata is encrypted.
  • In transit: All API calls use TLS 1.2+.
  • Data residency: Both metadata (S3 Tables) and raw files (FSx for ONTAP) remain in the same AWS region. No cross-border data transfer occurs in the default architecture.

For detailed data sovereignty analysis, see the Architecture Document — Data Sovereignty section.

Audit Log Retention

  • CloudTrail: Default 90-day event history. For long-term retention, create a Trail delivering to S3 (recommended: 1+ year for regulated industries)
  • Lake Formation: Data access audit logs are recorded via CloudTrail
  • OpenSearch: Access logs can be delivered to CloudWatch Logs
  • Analysis: Use CloudTrail Lake (SQL queries) or Athena + S3 (cost-efficient) for audit analysis

For detailed operational monitoring setup, see the Operational Monitoring section in the architecture document.

Path Sensitivity Model

File paths can reveal sensitive context even when file contents are not exposed (e.g., /hr/layoffs/2026/ or /legal/mna/target-company/).

Recommended controls:

  • Store raw_path only in the restricted metadata table
  • Expose hashed_path or anonymized_path to general users
  • Use path_classification: public / internal / restricted / confidential
  • Apply Lake Formation grants to curated views, not the base table

Raw Data Access Boundary

This architecture governs metadata access through S3 Tables and Lake Formation. It does not automatically replace:

  • ONTAP/NFS/SMB file-system permissions
  • S3 Access Point resource policies
  • IAM permissions for raw file reads
  • Application-level authorization
  • Downstream use of presigned URLs or copied files

Treat metadata governance and raw data governance as two linked but separate control planes. Both must be configured for end-to-end security.

S3 Access Point Identity Boundary

Each FSx for ONTAP S3 Access Point has an associated file-system identity (OntapFileSystemIdentity — UNIX UID/GID or Windows domain user). All file access through that AP is authorized as that identity.

For each access point, document:

  • IAM principals allowed to use the access point
  • Access point policy (allowed S3 actions)
  • Associated UNIX or Windows file-system identity
  • Allowed volume / prefix scope
  • Whether the identity can access files beyond what metadata governance intends
  • Audit evidence location

If the AI enrichment access point uses a broad UNIX identity (e.g., root or a service account with wide read access), metadata-level Lake Formation controls do not prevent raw file reads through that AP. Scope the AP identity to minimum required access.

See security/s3-access-point-identity-matrix.yaml for the template.

Permission Identity Strategy

For multiprotocol environments (NFS + SMB + S3 AP):

  • Record discovery_protocol: nfs / smb / s3ap
  • Record access_point_identity_type: unix / windows
  • Record effective_reader_identity
  • Record permission_source: nfs_mode / ntfs_acl / mixed
  • Do not assume metadata visibility implies raw file readability

Retention and Deletion Semantics

This PoC uses metadata records to represent file discovery and enrichment state. For regulated workloads, define:

  • Metadata retention period (how long to keep catalog records)
  • Raw file retention period (governed by storage policy, not this catalog)
  • Anonymized metadata retention period
  • Deletion request workflow (who can request, who approves, how it's executed)
  • Snapshot expiration impact on deletion (Iceberg time travel may expose deleted metadata until snapshots expire)
  • Audit evidence retention (keep deletion evidence longer than the data itself)

Important: Iceberg time travel is useful for recovery, but it means deleted metadata may still be queryable during the snapshot retention window. Align snapshot expiration with your data deletion SLA.

Snowflake-side retention: If redacted metadata is synced into Snowflake-managed tables, define Snowflake-side retention, Time Travel (default 1 day, up to 90 days), and Fail-safe (7 days, non-configurable) separately from Iceberg snapshot retention. Deletion from the Snowflake copy does not delete from the Iceberg source, and vice versa.

Approval Evidence Template (for Regulated Industries)

For organizations requiring formal access approval documentation:

Approval ID: <unique-id>
Data owner: <name/group>
Security owner: <name/group>
Platform owner: <name/group>
Allowed metadata columns: <columns>
Allowed raw file prefixes: <prefixes>
Allowed operations: metadata query only / raw file read / anonymized export
Review date: <date>
Expiration date: <date>
Evidence location: verification-evidence/<path>
Enter fullscreen mode Exit fullscreen mode

Regulated Workload Readiness

For public sector, healthcare, financial services, and other regulated industries, validate the following before production deployment:

Area Requirement Status in this PoC
Data residency Metadata and raw files in same AWS Region ✅ Single region (ap-northeast-1)
Encryption at rest S3 Tables: SSE-S3; FSx: at-rest encryption ✅ Default encryption
Encryption in transit TLS 1.2+ for all API calls ✅ AWS default
Raw data access boundary File reads governed by S3 AP policy + ONTAP permissions ✅ Documented
Metadata access boundary Lake Formation table-level + CloudTrail audit ✅ Verified
AI processing data flow Content sent to Bedrock API, not stored by provider ✅ Per AWS data protection policy
PII detection limitations English (Comprehend) + Japanese (Claude) only ⚠️ Other languages not covered
Human review workflow Low-confidence queue defined ✅ Design documented
Audit log retention CloudTrail 90-day default; configure Trail for longer ⚠️ Requires Trail setup
Deletion SLA Define separately for metadata, raw files, and snapshots ⚠️ Requires policy definition
Legal/compliance sign-off Not in scope for this PoC ❌ Required before production

AI governance note: AI enrichment in this pattern is assistive metadata generation. It does not constitute authoritative regulatory classification. Final classification decisions, data handling approvals, and compliance certifications must be confirmed by data owners, security teams, legal counsel, and compliance officers.

Cross-Platform Access: The Current Reality

Fully Verified ✅

Platform Access Method Status
Athena Direct query via Glue federated catalog ✅ Fully verified
Lambda/Python PyIceberg SDK ✅ Fully verified
EMR Spark Glue Iceberg REST (EMR 7.13.0+) ✅ Fully verified (SELECT, COUNT, time travel)
Snowflake Glue Iceberg REST + VENDED_CREDENTIALS ✅ Fully verified (CREATE TABLE, SELECT, COUNT, DESCRIBE, AUTO_REFRESH)
Snowflake External Stage (FSx S3 AP) + TO_FILE + Cortex AI ✅ Fully verified

Expected / Requires Validation ⚠️

Platform Access Method Status
EMR Trino Glue Iceberg REST (EMR 7.13.0+) ⚠️ Expected (same EMR SigV4 handling as Spark)
Redshift Spectrum Same as Athena (Glue catalog) ⚠️ Expected, not fully validated

What Doesn't Work (Yet) ⚠️

Platform Tested method Result Tested Status
Databricks SQL Warehouse CREATE CONNECTION TYPE iceberg_rest to S3 Tables REST CONNECTION_TYPE_NOT_SUPPORTED 2026-05-31 Observed limitation in this path
Databricks Spark cluster Iceberg REST + SigV4 via spark.conf.set / cluster config NO_SUCH_CATALOG_EXCEPTION (UC blocks external catalog registration) 2026-06-01 Confirmed: UC Foreign Catalog required
Databricks Delta Sharing Delta Sharing server accessing S3 AP-backed storage Sharing server uses same UC storage credentials; cannot bypass session policy 2026-06-01 Confirmed limitation (not a workaround for S3 AP)
Databricks NFS → UC Volume NFS mount path as UC External Volume Cloud storage URIs only (s3://, abfss://, gs://); NFS/FUSE paths not supported 2026-06-01 Confirmed limitation; internal feature request exists
Snowflake External Iceberg Table with S3 Tables direct REST endpoint Not a supported catalog type (use Glue REST instead) 2026-05-31 Use Glue REST + VENDED_CREDENTIALS (✅ verified)
Snowflake CATALOG INTEGRATION with default ACCESS_DELEGATION_MODE Defaults to EXTERNAL_VOLUME_CREDENTIALS which triggers ListObjectsV2 (rejected by S3 Tables) 2026-06-02 ✅ Resolved: set explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
Snowflake Lake Formation column-level via VENDED_CREDENTIALS AllowFullTableExternalDataAccess=false blocks all VENDED_CREDENTIALS access 2026-06-08 Use Snowflake Horizon (Row Access Policy / Dynamic Masking) for column governance
Snowflake Open Catalog Polaris as Iceberg catalog Not tested TBD Strategic alternative

Databricks: Three Integration Paths to Validate

Update note (2026-06-09): We revalidated the S3 Tables path after Databricks announced GA for Foreign Iceberg and credential vending (May 28, 2026). Glue Connection creation and credential configuration succeeded, but Unity Catalog External Location validation still failed because S3 Tables internal buckets reject standard S3 API validation (HeadBucket/ListBucket). The S3 Tables path remains blocked in this tested Databricks UC configuration. A new Databricks support case has been submitted.

For Databricks business users: The value is not only table access. The value is turning previously invisible NAS files into governed metadata assets that can be searched, explained, lineage-tracked, and consumed from Databricks SQL, AI/BI, and dashboards.

In this PoC, CREATE CONNECTION TYPE iceberg_rest to the S3 Tables REST endpoint returned CONNECTION_TYPE_NOT_SUPPORTED on Databricks SQL Warehouse (tested 2026-05-31). This does not mean Databricks lacks Iceberg REST support — Databricks provides Unity Catalog Iceberg REST endpoints and Foreign Iceberg capabilities that evolve rapidly.

Confirmed Limitations (2026-06-01)

Path Result Confirmed by
Spark cluster + Iceberg REST (spark.conf.set / cluster config) ❌ UC blocks external catalog registration Databricks support + our testing
Delta Sharing via S3 Access Point ❌ Sharing server uses same UC storage credentials Databricks support
NFS mount path as UC External Volume ❌ Cloud storage URIs only (s3://, abfss://, gs://) Databricks support
DataSync → S3 → UC External Delta Table → Delta Sharing ✅ Works (Delta format required) Databricks support

Delta Sharing note: Delta Sharing is not a workaround for the FSx S3 Access Point session policy limitation in our tested path. The sharing server uses the same UC storage credentials and cannot bypass the session policy that blocks S3 AP ARNs. Note that Databricks has announced first-class Iceberg format support in Delta Sharing (Jan 2026), enabling providers to share Iceberg tables via the Iceberg REST Catalog API. This broader capability is not contradicted by our finding — our limitation is specific to S3 AP-backed storage access through UC credentials, not Delta Sharing's format support in general.

NFS Volume note: UC External Volumes require cloud storage URIs. An internal feature request (AHA) exists for EFS/NFS access via UC. Until this is implemented, DataSync → S3 → UC External Location remains the only supported path.

📢 Databricks users: If S3 Tables access from Databricks is important for your workflow, the UC Foreign Catalog for S3 Tables feature is being tracked internally by Databricks (request DB-I-15824). Contact your Databricks account team to express interest and increase prioritization. Snowflake achieved full S3 Tables access via VENDED_CREDENTIALS in June 2026 — the same architectural pattern should be feasible for UC.

Immediate workaround for Databricks: Use DataSync → S3 → UC External Table to sync metadata into a standard S3 location accessible by Unity Catalog. This is not zero-copy for the synced metadata, but raw files remain on FSx for ONTAP.

Path 1: Spark cluster + Iceberg REST (SigV4)

Best for technical validation and batch processing. Two endpoint options:

Tested 2026-06-01: On Databricks with Unity Catalog enabled, external Iceberg catalogs cannot be registered via spark.conf.set or cluster Spark config. Unity Catalog controls catalog registration exclusively. Both Serverless (CONFIG_NOT_AVAILABLE) and All-Purpose clusters (NO_SUCH_CATALOG_EXCEPTION) fail. Unity Catalog Foreign Catalog (Path 2) is the required approach.

# Path 1a: Direct S3 Tables REST endpoint (used in this PoC)
spark.sql.catalog.s3tables.uri=https://s3tables.ap-northeast-1.amazonaws.com/iceberg
spark.sql.catalog.s3tables.rest.signing-name=s3tables

# Path 1b: AWS Glue Iceberg REST endpoint (recommended for production + Lake Formation)
spark.sql.catalog.s3tables.uri=https://glue.ap-northeast-1.amazonaws.com/iceberg
spark.sql.catalog.s3tables.rest.signing-name=glue
Enter fullscreen mode Exit fullscreen mode

Common config for both:

spark.sql.catalog.s3tables=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3tables.catalog-impl=org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.s3tables.warehouse=arn:aws:s3tables:ap-northeast-1:<ACCOUNT>:bucket/fsxn-metadata-catalog
spark.sql.catalog.s3tables.rest.sigv4-enabled=true
spark.sql.catalog.s3tables.rest.signing-region=ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

Path 2: Unity Catalog Foreign Iceberg

Register external Iceberg tables into Unity Catalog if supported for the target catalog/storage path. Best for Databricks governance, lineage, and discovery. Verify refresh semantics and read/write limitations. Retested for S3 Tables on 2026-06-09: Glue Connection and credentials succeeded, but UC External Location validation failed because S3 Tables internal buckets reject standard S3 API validation. This path remains blocked for the tested S3 Tables configuration.

Documentation/version note: Databricks Iceberg capabilities are evolving rapidly. Earlier documentation and our initial validation showed limitations around Foreign Iceberg credential vending and automatic refresh behavior. After the May 2026 GA announcement, we revalidated the S3 Tables path on 2026-06-09. Credential configuration progressed further, but the tested path still failed at UC External Location validation against S3 Tables internal storage. Additionally, Databricks supports catalog federation with AWS Glue (Hive Metastore type), which can expose Glue-registered tables in UC. Whether a future Iceberg REST catalog federation path could bypass the S3 Tables internal bucket constraint is an open question.

Refresh semantics: If UC Foreign Iceberg works for S3 Tables via Glue REST, define refresh semantics explicitly. Our metadata catalog is append-only (new records added on file events). Analysts should know whether Databricks reads the latest Iceberg snapshot automatically or only after REFRESH FOREIGN TABLE. Without auto-refresh, Athena and Databricks may show temporarily different results until the next refresh cycle. Plan for a scheduled refresh job or event-driven trigger.

AWS reference for this path: AWS has published guidance on accessing S3 Iceberg tables from Databricks using the Glue Iceberg REST Catalog. This validates the architectural direction of B-4/B-5, though S3 Tables-specific compatibility requires separate validation.

Path 3: AWS Glue Catalog Federation with Databricks

AWS Glue can federate metadata from Databricks Unity Catalog for Iceberg tables. This is the reverse direction but useful for cross-platform governance patterns.

Federation Directionality

Pattern Direction Primary governance Best for
UC Foreign Catalog / Catalog Federation to Glue Databricks reads AWS-managed metadata Unity Catalog Databricks users querying AWS Iceberg (S3 Tables)
AWS Glue federation to UC AWS reads Databricks-managed metadata Lake Formation / Glue Athena/EMR/Redshift reading UC Iceberg/UniForm

AWS reference: AWS has published guidance on accessing S3 Iceberg tables from Databricks using AWS Glue Iceberg REST Catalog, and on federating Databricks Unity Catalog data into AWS Glue Data Catalog. Both directions are documented.

Why Iceberg here (not Delta Lake)? This architecture uses Iceberg because S3 Tables is Iceberg-native, and the Iceberg REST endpoint enables multi-engine access (Athena, EMR, Snowflake). For Databricks-only environments, Delta Lake on S3 remains the natural choice. This pattern targets multi-platform scenarios.

Databricks UC Audit Logging for External Engines (Confirmed 2026-06-01)

External engine access via the UC Iceberg REST Catalog endpoint is fully auditable:

Audit aspect Confirmed behavior
Metadata requests (listNamespaces, listTables, loadTable) ✅ Logged in system.access.audit under uniformIcebergRestCatalog
Vended credential issuance ✅ Logged as loadTableCredentials / generateTemporaryTableCredential
Audit fields user_identity, source_ip_address, user_agent, event_time, action_name, request_params
Distinguish external vs internal service_name = 'uniformIcebergRestCatalog' (external) vs 'unityCatalog' (internal)

Note: Databricks audit logs record credential issuance, not individual S3 file reads after credentials are vended. Complement with AWS CloudTrail + S3 access logging for file-level audit.

Databricks integration documentation:

Snowflake: S3 Tables via Glue REST + VENDED_CREDENTIALS ✅

Working Configuration (Verified 2026-06-05)

Snowflake can directly query S3 Tables Iceberg tables via the Glue Iceberg REST endpoint with VENDED_CREDENTIALS. Here's the complete working setup:

-- 1. Catalog Integration (CRITICAL: explicit ACCESS_DELEGATION_MODE)
CREATE OR REPLACE CATALOG INTEGRATION s3tables_glue_rest_int
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'metadata'
  REST_CONFIG = (
    CATALOG_URI = 'https://glue.ap-northeast-1.amazonaws.com/iceberg'
    CATALOG_API_TYPE = AWS_GLUE
    CATALOG_NAME = '<ACCOUNT_ID>:s3tablescatalog/fsxn-metadata-catalog'
    ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS  -- MUST be explicit
  )
  REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-snowflake-verification-role'
    SIGV4_SIGNING_REGION = 'ap-northeast-1'
  )
  ENABLED = TRUE;

-- 2. Schema WITHOUT default EXTERNAL_VOLUME (critical)
CREATE SCHEMA FSXN_LAKEHOUSE.S3TABLES_VENDED;
USE SCHEMA FSXN_LAKEHOUSE.S3TABLES_VENDED;

-- 3. Table WITHOUT EXTERNAL_VOLUME parameter (critical)
CREATE ICEBERG TABLE s3tables_vended_creds_test
  CATALOG = 's3tables_glue_rest_int'
  CATALOG_TABLE_NAME = 'unstructured_files';
Enter fullscreen mode Exit fullscreen mode

AWS prerequisites (must be completed before Snowflake configuration):

# Register S3 Tables resource with Lake Formation (--with-federation is REQUIRED)
aws lakeformation register-resource \
  --resource-arn "arn:aws:s3tables:ap-northeast-1:<ACCOUNT_ID>:bucket/fsxn-metadata-catalog" \
  --role-arn "arn:aws:iam::<ACCOUNT_ID>:role/S3TablesRoleForLakeFormation" \
  --with-federation \
  --region ap-northeast-1

# Grant SELECT + DESCRIBE to Snowflake's IAM role (table-level)
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::<ACCOUNT_ID>:role/fsxn-snowflake-verification-role"}' \
  --resource '{"Table":{"CatalogId":"<ACCOUNT_ID>:s3tablescatalog/fsxn-metadata-catalog","DatabaseName":"metadata","Name":"unstructured_files"}}' \
  --permissions SELECT DESCRIBE \
  --region ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

IAM role policy must include: glue:GetTable, glue:GetDatabase, glue:GetCatalog, lakeformation:GetDataAccess, s3tables:GetTableBucket, s3tables:GetTable, s3tables:GetNamespace, s3tables:GetTableData, s3tables:GetTableMetadataLocation.

Verify integration (expected DESCRIBE output after creation):

DESCRIBE CATALOG INTEGRATION s3tables_glue_rest_int;
Enter fullscreen mode Exit fullscreen mode
Property Value
ENABLED true
CATALOG_SOURCE ICEBERG_REST
TABLE_FORMAT ICEBERG
CATALOG_NAMESPACE metadata
REST_CONFIG {CATALOG_URI=https://glue.ap-northeast-1.amazonaws.com/iceberg, ...}
REST_AUTHENTICATION {TYPE=SIGV4, SIGV4_IAM_ROLE=arn:aws:iam::<ACCOUNT_ID>:role/..., ...}
API_AWS_IAM_USER_ARN arn:aws:iam::465774455528:user/<snowflake-user-id>
API_AWS_EXTERNAL_ID <external-id-for-trust-policy>
REFRESH_INTERVAL_SECONDS 30

Setup step: Copy API_AWS_IAM_USER_ARN and API_AWS_EXTERNAL_ID from this output into your IAM role's trust policy to allow Snowflake to assume the role.

Verified Capabilities (2026-06-08)

Operation Result Performance
CREATE ICEBERG TABLE 6.5s
SELECT * LIMIT 5 ✅ (5 rows) 1.9s
COUNT(*) ✅ (170 rows) 66ms
DESCRIBE TABLE ✅ (23 columns) 69ms
ALTER ... SET AUTO_REFRESH = TRUE 131ms
SHOW ICEBERG TABLES ✅ (UNMANAGED type) 567ms
Time Travel ✅ (available, snapshot-dependent)

Screenshots (URL bar excluded, S3 Tables internal bucket masked):

COUNT(*) = 170 rows
COUNT() returns 170 rows in 66ms*

DESCRIBE TABLE — 23 columns
DESCRIBE TABLE shows all 23 Iceberg columns

SHOW ICEBERG TABLES — UNMANAGED type
SHOW ICEBERG TABLES confirms UNMANAGED type with S3TABLES_GLUE_REST catalog

SELECT * LIMIT 5 — data from S3 Tables
SELECT * LIMIT 5 returns actual file metadata from S3 Tables

AUTO_REFRESH + Time Travel verification:

AUTO_REFRESH: COUNT(*) = 171
AUTO_REFRESH verified: PyIceberg appended 1 record → Snowflake COUNT() automatically updated from 170 to 171 within 30 seconds*

Time Travel: AT(OFFSET => -1200) = 170
Time Travel: querying 20 minutes ago returns 170 (before the append), confirming snapshot history is accessible

About FILE_PATH: The FILE_PATH column shows the S3 path used during metadata ingestion (via FSx for ONTAP S3 Access Point). This is the path recorded in the Iceberg metadata catalog — it does not mean the files were copied to S3. The actual files remain on FSx for ONTAP and are accessible via NFS, SMB, or S3 Access Point depending on your application's protocol.

Key Insight: Why Previous Attempts Failed

ACCESS_DELEGATION_MODE defaults to EXTERNAL_VOLUME_CREDENTIALS when not explicitly specified. In this default mode, Snowflake validates storage access through the External Volume path, which triggers ListObjectsV2 against S3 Tables internal buckets — an operation that returns MethodNotAllowed.

With VENDED_CREDENTIALS explicit:

  1. Snowflake calls Glue REST loadTable
  2. Lake Formation (via GetTemporaryGlueTableCredentials) returns temporary storage credentials in the loadTable response config map
  3. Snowflake uses these credentials to access data files directly by exact path
  4. No ListObjectsV2 is required — Snowflake reads files by exact path from Iceberg metadata

Note: The Glue REST endpoint does not implement the standard Iceberg REST /credentials endpoint. Credential vending works through Lake Formation's proprietary mechanism embedded in the loadTable response. This is transparent to Snowflake when configured correctly.

Governance Limitation: Lake Formation Column-Level (2026-06-08)

Lake Formation column-level filtering is NOT enforced via the VENDED_CREDENTIALS path:

  • When AllowFullTableExternalDataAccess = false, the entire VENDED_CREDENTIALS path is blocked
  • Explicit column/table-level grants + ExternalDataFilteringAllowList do not resolve this
  • AllowFullTableExternalDataAccess = true is required for VENDED_CREDENTIALS to function

Technical context: AllowFullTableExternalDataAccess controls whether external engines (those using Lake Formation credential vending) can access table data without per-table SELECT grants. When set to false, fine-grained column/row filtering is the intended enforcement mechanism — but for S3 Tables accessed via VENDED_CREDENTIALS, this currently results in complete access denial rather than filtered access. This may be a service-specific constraint of the S3 Tables federated catalog path, or it may require additional AllowExternalDataFiltering + ExternalDataFilteringAllowList configuration that was not effective in our testing. A feature request has been submitted to AWS.

Workaround: Use Snowflake Horizon for column-level governance:

-- Row Access Policy: restrict by sensitivity_level
CREATE OR REPLACE ROW ACCESS POLICY metadata_sensitivity_filter AS
  (sensitivity_level VARCHAR) RETURNS BOOLEAN ->
    CASE
      WHEN IS_ROLE_IN_SESSION('SECURITY_ADMIN') THEN TRUE
      WHEN sensitivity_level IN ('public', 'internal') THEN TRUE
      ELSE FALSE
    END;

ALTER TABLE s3tables_vended_creds_test ADD ROW ACCESS POLICY
  metadata_sensitivity_filter ON (sensitivity_level);

-- Dynamic Data Masking: hide embedding vectors from non-ML roles
CREATE OR REPLACE MASKING POLICY mask_embedding AS
  (val BINARY) RETURNS BINARY ->
    CASE
      WHEN IS_ROLE_IN_SESSION('ML_ENGINEER') THEN val
      ELSE NULL
    END;

ALTER TABLE s3tables_vended_creds_test MODIFY COLUMN
  embedding_vector SET MASKING POLICY mask_embedding;
Enter fullscreen mode Exit fullscreen mode

Snowflake Iceberg Access Modes (Summary)

Access mode Best for Status
Glue REST + VENDED_CREDENTIALS S3 Tables direct query ✅ VERIFIED
External Stage (FSx S3 AP) + TO_FILE File AI analysis (Cortex COMPLETE) ✅ VERIFIED
Metadata sync to Snowflake table BI / Cortex Search / governance Available
Object Store Catalog Direct metadata file read ❌ Blocked (S3 Tables internal bucket)
Snowflake Open Catalog (Polaris) Alternative Iceberg catalog Not tested

📖 Investigation History (2026-06-01 to 2026-06-05) — click to expand

2026-05-31: Tested S3 Tables direct REST endpoint as External Iceberg catalog → not a supported catalog type.

2026-06-01: Created CATALOG INTEGRATION using ICEBERG_REST + AWS_GLUE + VENDED_CREDENTIALS. DESCRIBE succeeded but CREATE ICEBERG TABLE failed with "Failed to retrieve credentials from the Catalog". Root cause identified: Glue REST does not implement /credentials endpoint (UnknownOperationException).

2026-06-02: AWS Support confirmed Lake Formation uses proprietary mechanism (GetTemporaryGlueTableCredentials) for credential vending, not standard Iceberg REST /credentials. Snowflake Support confirmed Error 004174 occurs when s3.access-key-id/secret/token absent from loadTable response.

2026-06-02: Tested Object Store catalog and EXTERNAL_VOLUME_CREDENTIALS mode — both blocked by S3 Tables internal bucket rejecting ListObjectsV2.

2026-06-03: Discovered register-resource --with-federation was missing. After setup, loadTable response included credentials. However, CREATE TABLE still failed at storage validation (ListObjectsV2).

2026-06-05: Snowflake Support identified the critical distinction: ACCESS_DELEGATION_MODE defaults to EXTERNAL_VOLUME_CREDENTIALS. Explicitly setting VENDED_CREDENTIALS + schema without External Volume + CREATE TABLE without External Volume parameter → SUCCESS. CREATE TABLE + SELECT both working.

2026-06-08: Additional testing confirmed COUNT(*), DESCRIBE, AUTO_REFRESH, SHOW ICEBERG TABLES all working. Lake Formation column-level filtering NOT enforced via this path (AllowFullTableExternalDataAccess=false blocks all access).

External Stage note: Snowflake External Stage against the FSx S3 Access Point alias was verified in this PoC (2026-05-31, ap-northeast-1). Update (2026-06-02): TO_FILE (Cortex COMPLETE multimodal) also verified working — Claude Sonnet 4.5 can directly read files from FSx for ONTAP via S3 AP-backed External Stage. See snowflake/external-stage-fsx-s3ap-validation.md for exact DDL and verified operations.

Snowflake Metadata Activation Pattern

If you sync only the metadata into Snowflake (not raw files), you preserve the zero-copy principle for actual data while enabling Snowflake-native use cases:

  • Governed metadata analytics and executive dashboards
  • File inventory and PII coverage reporting
  • Cortex Search over redacted summaries (RAG on metadata)
  • Snowflake Intelligence / Cortex Analyst style business Q&A
  • Row Access Policies and Dynamic Masking on synced metadata

Horizon Catalog note: When metadata reaches Snowflake, Snowflake governance features such as Row Access Policies and Dynamic Masking can be applied to Snowflake-managed access paths. For external engine access via Iceberg REST, validate the exact Open Catalog / Horizon behavior for your target engine and security model.

Metadata sync best practice: Sync curated latest-record metadata, not the append-only base table, unless analysts explicitly need history. Preserve scan_run_id, change_type, and is_deleted for audit and reconciliation. Use MERGE INTO keyed by file_id or path_hash to make metadata activation idempotent. See snowflake/metadata-sync-example.sql for the full pattern.

Governance policy mapping: When syncing metadata into Snowflake, map AWS-side fields such as sensitivity_level, tenant_id, pii_status, and path_classification to Snowflake tags, masking policies, and row access policies. Track policy drift between Lake Formation and Snowflake governance. See snowflake/path-decision-guide.md for the full policy mapping.

Snowflake Cortex Search Activation Pattern

If redacted metadata and summaries are synced into Snowflake, Cortex Search can provide Snowflake-native enterprise search and RAG over metadata — without managing embeddings, infrastructure, or search quality tuning.

Why Cortex Search here:

  • Business users can search approved metadata without operating a separate vector database
  • RAG and enterprise search can run over redacted summaries already governed in Snowflake
  • Search quality, embedding management, and index refresh are delegated to Snowflake-managed services
  • This is best suited for Snowflake-first organizations that want business-facing discovery inside the AI Data Cloud

Use Cortex Search for:

  • Executive metadata search (natural language queries over file inventory)
  • File inventory Q&A (powered by LLM + retrieval)
  • PII coverage reporting and compliance dashboards
  • Governed search over redacted summaries

OpenSearch Serverless NextGen remains the AWS-native serving index for this PoC. Cortex Search is an optional Snowflake-native alternative for organizations that standardize on Snowflake for business discovery.

Role separation: S3 Tables / Iceberg remains the metadata source of truth. OpenSearch (AWS path) or Cortex Search (Snowflake path) are serving indexes for search UX. Choose based on your primary platform. Cortex Search operates over redacted summaries and metadata synced into Snowflake, not raw files, unless the customer explicitly chooses to copy/extract document content into Snowflake (which would break the zero-copy raw data principle).

Cortex Search scope: Cortex Search should operate on redacted metadata and summaries by default. If raw document content is extracted or copied into Snowflake for Cortex use cases, treat that as a separate data movement decision with its own governance, retention, and cost model.

Snowflake activation cost drivers: Snowflake activation introduces separate cost drivers from the AWS-native catalog: warehouse compute for metadata sync tasks and dashboards, Cortex Search service usage (based on corpus size and query volume), task/stream orchestration for refresh, and small metadata storage. These costs should be modeled separately from the AWS-native catalog cost ($114/month estimate in Part 1 does not include Snowflake-side compute).

Retention alignment: Confirm Snowflake account edition, table type, and retention settings before promising deletion SLAs. Snowflake Time Travel (1–90 days) and Fail-safe (7 days) operate independently from Iceberg snapshot expiration. Snowflake-side deletion evidence should be retained separately from Iceberg snapshot expiration evidence.

Snowflake Metadata Product Contract

When activating metadata in Snowflake, expose a curated subset as the governed metadata product:

Recommended curated columns:

Column Purpose Governance
file_id Unique identifier
business_domain Organizational grouping Row access policy
file_type File format
classification AI-generated classification
sensitivity_level Data sensitivity tier Snowflake tag + masking policy
pii_status PII detection result Access policy / dashboard filter
redacted_summary AI-generated (PII-free) summary Cortex Search source column
owner_team Business ownership Business glossary / stewardship
last_seen_at Last scan timestamp
data_quality_status Enrichment quality flag

Snowflake governance mapping:

  • sensitivity_level → Snowflake tag + masking policy
  • tenant_id / business_domain → row access policy
  • pii_status → access policy / dashboard filter
  • redacted_summary → Cortex Search source column
  • owner_team → business glossary / stewardship workflow

Databricks Metadata Activation Pattern

If UC Foreign Catalog is not yet validated for your S3 Tables path, sync only the redacted metadata into a UC-managed Delta table. This preserves the zero-copy principle for raw files while enabling Databricks-native use cases:

  • Databricks SQL dashboards and executive reporting
  • AI/BI Genie over curated metadata (natural language queries)
  • UC lineage and audit on metadata usage
  • ML feature generation from file metadata
  • Operational reporting on PII coverage and enrichment backlog

Raw files remain on FSx for ONTAP. Only the small metadata table (~MB scale for 100K files) is synced.

This is analogous to the Snowflake metadata activation pattern: it copies only curated metadata, not the original unstructured files. Both patterns preserve the zero-copy principle for raw data.

Databricks Raw File Access Decision:

Requirement Recommended path
Governed metadata analytics only UC Foreign Catalog (if validated) or sync metadata to UC Delta
Raw file processing in Databricks DataSync → S3 → UC External Volume
Zero-copy raw file access from Databricks Not supported in validated paths (NFS mount works but without UC governance)
Business discovery / BI Sync redacted metadata to UC Delta table

If metadata is synced into Databricks for BI, include Databricks SQL / Jobs compute cost in the activation model. This does not affect raw-file zero-copy storage, but it is part of the business-facing analytics cost.

Other Lakehouse Engines to Validate

Beyond Databricks and Snowflake, the most natural validation targets for this metadata catalog are:

Engine Access path Likely fit Validation priority
Trino / Starburst Glue Iceberg REST or S3 Tables REST Federated SQL, ad hoc query High
EMR Spark Glue Iceberg REST (native since EMR 7.5.0+) Bulk backfill, batch enrichment High
Redshift Spectrum Glue catalog (external schema) DWH integration, BI Medium
Dremio Glue catalog or Iceberg REST Query acceleration, BI Medium
StarRocks / Doris Glue Iceberg REST Low-latency serving queries Medium
Apache Flink Glue Iceberg REST Streaming metadata updates Low
dbt (via Athena) dbt-athena + Iceberg materialization Analytics engineering, governed marts Medium
Apache NiFi Iceberg REST or Polaris Event-driven ingestion Low

These engines should be validated against:

  • S3 Tables direct REST vs AWS Glue Iceberg REST
  • Read vs write capability
  • Lake Formation behavior (credential vending, column/row filtering)
  • Snapshot freshness after external writes
  • Latest-record view compatibility
  • Case-sensitivity and lowercase naming requirements

Key finding from validation (2026-06-08): AWS Glue Iceberg REST supports SigV4-authenticated catalog access. Lake Formation credential vending works through a proprietary mechanism (GetTemporaryGlueTableCredentials). Snowflake requires explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS — the default mode fails. Engines that can sign requests with their own IAM credentials (EMR Spark ✅ verified, Trino on EMR expected, PyIceberg ✅ verified) work out of the box. Snowflake also works when configured correctly (✅ verified 2026-06-05). EMR requirement: 7.13.0+ (7.5.0 has a credential resolution bug). Governance note: Lake Formation column-level filtering is NOT enforced via the VENDED_CREDENTIALS path for Snowflake.

Trino note: AWS has published guidance on querying S3 Tables from Trino using the Iceberg REST endpoint. Trino's Iceberg connector supports REST catalogs natively, making it one of the most straightforward third-party validation targets.

EMR Spark note: For large-scale backfill or re-enrichment (100K+ files), Spark on EMR Serverless or EMR on EC2 can be used as an alternative to Lambda/Fargate. Use Glue Iceberg REST for centralized metadata access with Lake Formation governance. Verified (2026-06-02): EMR Serverless Spark 7.13.0 successfully reads S3 Tables metadata via Glue Iceberg REST — SHOW NAMESPACES, SHOW TABLES, SELECT, COUNT, and snapshot history all work. Requires EMR 7.13.0+ (7.5.0 has a credential resolution bug for S3 Tables warehouse format).

Redshift note: Validate separately from Athena — external schema setup, Glue statistics, Lake Formation permissions, and query latency against latest-record views may differ.

For the full compatibility matrix, see lakehouse-tools/tool-compatibility-matrix.yaml.

Catalog Authority Rule

For each Iceberg table, define exactly one authoritative catalog for metadata pointer and commit coordination. Do not operate S3 Tables, Polaris, Gravitino, Nessie, and Glue as independent writable catalogs for the same table unless the integration explicitly supports federation without dual writes.

                    ┌──────────────────────┐
                    │ Authoritative Catalog│
                    │ (ONE per table)      │
                    │ • S3 Tables + Glue   │
                    │   (this PoC)         │
                    └──────────┬───────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
              ▼                ▼                ▼
         Read-only        Read-only        Read-only
         consumers        consumers        consumers
         (Trino,          (Databricks,     (Snowflake,
          Dremio,          UC Foreign       Cortex,
          StarRocks)       Catalog)         Open Catalog)
Enter fullscreen mode Exit fullscreen mode

Split-brain warning: If two catalogs independently write to the same Iceberg table, snapshot pointers can diverge, causing data loss or corruption. Federation (one writer, many readers) is safe. Dual-write is not.

The Bigger Picture

                    S3 Tables (Iceberg)
                           │
              ┌────────────┼────────────┐
              │            │            │
              ▼            ▼            ▼
         Athena ✅    Databricks ❌  Snowflake ✅
         EMR Spark ✅  (UC Foreign    (Glue REST +
         PyIceberg ✅   path still    VENDED_CREDENTIALS
                      blocked in     verified 2026-06-05)
                      tested config)
Enter fullscreen mode Exit fullscreen mode

Databricks integration summary (confirmed 2026-06-01):

  • Direct S3 AP access: ❌ (UC session policy)
  • NFS mount → UC Volume: ❌ (cloud URI only)
  • Delta Sharing via S3 AP: ❌ (same credentials)
  • DataSync → S3 → UC: ✅ (supported workaround, not zero-copy for synced data)
  • UC Foreign Catalog / Foreign Iceberg via Glue: ❌ Retested 2026-06-09; Glue Connection and credentials succeeded, but UC External Location validation failed against S3 Tables internal storage. Support case submitted.

For capability-level details such as read, write, time travel, metadata tables, and governance behavior, see verification-evidence/cross-platform-compatibility.yaml.

This is a temporary gap. S3 Tables is relatively new (GA Dec 2024), and cross-platform federation is actively being developed. Feature requests have been filed with both platforms. Timeline for native S3 Tables support is unknown, but the Iceberg ecosystem is converging rapidly — Unity Catalog 2.0's native Iceberg support and Snowflake's Open Catalog (Polaris) both point toward broader interoperability.

Catalog Decision Guide

In the Iceberg world, the catalog is the system of record for table metadata pointers and atomic operations. Choose based on your primary platform:

Primary platform Recommended catalog Notes
AWS-first / Athena-first S3 Tables + Glue/Lake Formation Used in this PoC
Databricks-first Unity Catalog Managed/Foreign Iceberg Best for UC governance, lineage, discovery
Snowflake-first Snowflake Open Catalog (Polaris) Best for Snowflake-governed Iceberg interoperability; validate external engine governance behavior
Neutral / OSS-first Apache Polaris or other REST catalog Maximum portability

Dual catalog warning: Avoid running two authoritative catalogs for the same Iceberg table. Use Snowflake Open Catalog / Polaris when Snowflake or a neutral REST catalog should be authoritative. Use S3 Tables when AWS-native Athena / Lake Formation / Glue governance is authoritative. If both platforms need access, use federation (one authoritative catalog, others read via REST).

When to Consider Snowflake Open Catalog / Polaris

Use S3 Tables + Glue/Lake Formation when AWS-native governance is authoritative (this PoC).

Consider Snowflake Open Catalog / Polaris when:

  • Snowflake should be the primary governance and interoperability plane
  • Multiple engines need Iceberg REST access through a neutral catalog
  • Snowflake-managed Iceberg or Snowflake-first AI/Data Cloud workflows are the center of gravity
  • You want managed Polaris instead of operating your own REST catalog

This would be a different authoritative-catalog design from the current PoC and should not be mixed as a second writer for the same table.

Databricks-first note: For organizations standardizing on Databricks, consider whether the metadata catalog itself should be managed in Unity Catalog as Managed Iceberg or Delta + UniForm, then exposed to AWS engines through Glue federation to UC or the UC Iceberg REST endpoint. Use S3 Tables when AWS-native Athena/Lake Formation is the primary governance path. The choice depends on which governance plane (UC or Lake Formation) is authoritative for your organization.

Format Decision for Databricks Environments

Option Best for Tradeoff
S3 Tables Iceberg AWS-first Athena/LF governance UC integration pending (Foreign Catalog validation)
UC Managed / Foreign Iceberg Databricks-first open format governance Validate current feature availability, region support, and limitations
Delta + UniForm Databricks-native pipelines + Iceberg read compatibility Iceberg metadata generated asynchronously; non-Databricks writes constrained
Metadata sync to Delta BI activation in Databricks SQL Metadata copy, but raw files remain zero-copy

Summary: What We Built

Layer Technology Status
Storage FSx for ONTAP (files) + S3 Tables (metadata) ✅ Verified
AI Bedrock Claude Vision + Titan Embeddings V2 ✅ Verified
Search OpenSearch Serverless NextGen (scale-to-zero) ✅ Verified
Governance Lake Formation (table-level) + CloudTrail ✅ Verified
PII Comprehend (EN) + Bedrock Claude (JA) ✅ Verified
Cross-platform Athena ✅, EMR Spark ✅, PyIceberg ✅, Snowflake ✅, Databricks ⚠️ Mostly verified

The Numbers

  • 42 seconds: Full demo execution time
  • $0.07: Total demo cost
  • Near $0 idle compute/search cost: Persistent metadata, logs, and audit trails may still incur small charges
  • $114/month: Projected cost at 100K files, 1000 changes/day
  • 95%: Storage cost reduction vs S3 full copy
  • 0.95: AI classification confidence (invoice detection)
  • 7/7: PII entities detected and redacted

For regulated workloads, align Iceberg snapshot retention with deletion SLAs and audit evidence retention.

What's Next for This Project

  1. Monitor support cases: Databricks UC Foreign Catalog for S3 Tables — timeline unknown
  2. Production hardening: SQS batching, DLQ alerting, reconciliation jobs
  3. Multi-language PII: Extend beyond EN/JA to other languages
  4. Cost optimization: Provisioned Throughput for high-volume Bedrock usage
  5. Production semantics: File identity, latest-record views, index reconciliation, and snapshot retention alignment
  6. ONTAP production hardening: S3 Access Point identity matrix, FPolicy event filtering, SnapMirror catalog rebinding, and FSx performance dashboard
  7. Snowflake governance: Implement Horizon Row Access Policies and Dynamic Data Masking for column-level protection (since Lake Formation column-level is not enforced via VENDED_CREDENTIALS)

Get Involved

  • Star the repo if this was useful
  • 🐛 Open an Issue for questions or suggestions
  • 🍴 Fork and adapt for your own unstructured data catalog

This concludes the 3-part series. All code is at github.com/Yoshiki0705/fsxn-lakehouse-integrations. Questions? Open a GitHub Issue.

Governance disclaimer: This article provides governance guidance and architectural patterns. It does not substitute for legal or compliance judgment. Final regulatory determinations should be confirmed with legal and compliance teams.

Top comments (0)