Previously...
In Part 1, we built the metadata catalog. In Part 2, we added AI classification and vector search. Now we need to answer the hard questions:
- Who can see what? (governance)
- What about PII? (anonymization)
- Can Databricks/Snowflake access this? (cross-platform)
Lake Formation: Governance on Unstructured Data
The Problem
Unstructured data on NAS storage may be well protected at the file-system layer, but it is often not consistently classified, searchable, or governed from analytics and AI workflows:
- No unified classification → you may not know what's sensitive across the entire corpus
- File-system permissions exist, but analytics/AI tools can't leverage them for discovery
- Audit trails may exist at the file-system layer, but they are often not unified with analytics and AI query activity
The Solution
With metadata in S3 Tables (Iceberg), Lake Formation provides:
┌───────────────────────────────────────────────────┐
│ Lake Formation │
│ │
│ Table-level: SELECT, DESCRIBE │
│ Column exposure: controlled via Athena Views │
│ (hide embedding_vector, paths) │
│ Row filtering: WHERE sensitivity_level = 'public'│
│ Audit: CloudTrail logs metadata queries │
└───────────────────────────────────────────────────┘
Verified: Access Control in Action
Step 1: Authorized user queries metadata
→ ✅ SUCCEEDED (3 rows returned)
Step 2: Revoke SELECT permission
→ 🔒 BLOCKED: "Column 'file_name' cannot be resolved
or requester is not authorized"
Step 3: Restore permission
→ ✅ SUCCEEDED (access restored)
Step 4: CloudTrail audit
→ All queries logged with user identity and timestamp
Every query against the metadata table is governed and audited. This gives you 100% metadata query governance coverage in this PoC. Raw file access remains governed separately by FSx for ONTAP file-system permissions, S3 Access Point policies, and application-specific access paths.
Lake Formation Governance Status
| Capability | Status | Notes |
|---|---|---|
| Table-level SELECT / DESCRIBE | ✅ Verified | Grant/revoke works correctly |
| Athena query governance | ✅ Verified | Unauthorized access blocked |
| CloudTrail audit logging | ✅ Verified | All queries logged with user identity |
| Column-level exclusion (ColumnWildcard) | ⚠️ Failed | On tested S3 Tables federated catalog path |
| Row-level filtering / LF-Tags | 📋 Design pattern | Taxonomy defined, needs validation |
| Column exposure via Athena Views | ✅ Workaround | Recommended alternative to column-level grants |
Observed Limitation: Column-Level Grants on This S3 Tables Federated Catalog Path
In this PoC, table-level Lake Formation SELECT grants worked as expected. However, column exclusion grants using ColumnWildcard with ExcludedColumnNames returned InvalidInputException: Permissions modification is invalid against the s3tablescatalog/... federated catalog path we tested.
AWS documentation describes table, column, and row-level permissions for S3 Tables integrated with Lake Formation. Therefore, treat this as an observed limitation in our specific validation path (CLI command, region, catalog ID, engine version), not a confirmed general product limitation. The exact error and test conditions are recorded in the verification evidence.
Workaround: Create Athena Views that expose only permitted columns:
-- View for general users (no embeddings, no PII paths)
CREATE VIEW metadata.public_files AS
SELECT file_id, file_name, file_type, classification, confidence_score
FROM "s3tablescatalog/fsxn-metadata-catalog"."metadata"."unstructured_files"
WHERE is_deleted = false AND sensitivity_level = 'public';
-- Apply Lake Formation on the view
-- Users query the view, not the base table
Governance model choice: For simple use cases, table/column-level permissions suffice. For dynamic, attribute-based access (e.g., "only files classified as 'public'"), use LF-Tags. For enterprise SSO integration, combine with IAM Identity Center. For enterprise governance, map
sensitivity_level,path_classification,tenant_id, andpii_statusto LF-Tags. Seegovernance/lf-tag-taxonomy.yaml.Untested alternative: Registering the S3 Tables table in a standard (non-federated) Glue Catalog may enable column-level permissions. This requires manual Iceberg metadata location configuration and has not been verified.
PII Detection: English + Japanese
The Challenge
Amazon Comprehend's detect_pii_entities API supports only English and Spanish. For Japanese PII (names, addresses, My Number), we need a different approach.
Dual-Engine Architecture
| Language | Engine | Detectable PII | Latency | Cost |
|---|---|---|---|---|
| English | Amazon Comprehend | NAME, EMAIL, PHONE, ADDRESS, SSN, CREDIT_CARD, DATE_TIME | ~200ms | $0.0001/100 chars |
| Japanese | Bedrock Claude | 氏名, メール, 電話, 住所, マイナンバー, クレジットカード, 生年月日 | ~2-5s | ~$0.003/request |
Data privacy note: When using Bedrock Claude for PII detection, document text is sent to the Bedrock API. Per AWS's data privacy policy, Bedrock does not store or use your inputs/outputs to train models. For highly sensitive workloads, consider VPC endpoints and AWS PrivateLink for Bedrock access.
Japanese PII Detection (Verified)
# Bedrock Claude detects Japanese PII via prompt
response = bedrock.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"messages": [{"role": "user", "content":
f"Detect all PII in this text. Return JSON array: "
f'[{{"type":"...","value":"...","begin":N,"end":N}}]\n\n'
f"Text:\n{japanese_text}"}]
})
)
Results on a controlled synthetic sample (not real personal data):
| PII Type | Detected Value |
|---|---|
| NAME | 山田太郎 |
| taro.yamada@example.co.jp | |
| PHONE | 090-1234-5678 |
| ADDRESS | 〒150-0002 東京都渋谷区渋谷1-2-3 |
| MY_NUMBER | 1234 5678 9012 |
| CREDIT_CARD | 4111-1111-1111-1111 |
| DATE_OF_BIRTH | 1985年3月15日 |
Anonymization Pipeline
Original document
│
▼
PII Detection (Comprehend or Bedrock)
│
├─ No PII → has_pii = false (no action needed)
│
└─ PII found → has_pii = true
│
▼
Redaction: all PII → [REDACTED]
│
▼
Store anonymized version
anonymization_status = "completed"
Before:
Name: Taro Yamada
Email: taro.yamada@example.com
Phone: 090-1234-5678
SSN: 123-45-6789
After:
Name: [REDACTED]
Email: [REDACTED]
Phone: [REDACTED]
SSN: [REDACTED]
Data Clean Room Pattern
┌─────────────────────────────────────────┐
│ Restricted Table (full metadata) │
│ • has_pii, anonymized_path, raw paths │
│ • Access: Security team only │
│ • Lake Formation: strict SELECT grant │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Public Table (anonymized metadata) │
│ • classification, summary (redacted) │
│ • No PII, no raw file paths │
│ • Access: All analysts │
│ • Lake Formation: broad SELECT grant │
└─────────────────────────────────────────┘
Encryption and Data Residency
- At rest: S3 Tables uses SSE-S3 encryption by default. All metadata is encrypted.
- In transit: All API calls use TLS 1.2+.
- Data residency: Both metadata (S3 Tables) and raw files (FSx for ONTAP) remain in the same AWS region. No cross-border data transfer occurs in the default architecture.
For detailed data sovereignty analysis, see the Architecture Document — Data Sovereignty section.
Audit Log Retention
- CloudTrail: Default 90-day event history. For long-term retention, create a Trail delivering to S3 (recommended: 1+ year for regulated industries)
- Lake Formation: Data access audit logs are recorded via CloudTrail
- OpenSearch: Access logs can be delivered to CloudWatch Logs
- Analysis: Use CloudTrail Lake (SQL queries) or Athena + S3 (cost-efficient) for audit analysis
For detailed operational monitoring setup, see the Operational Monitoring section in the architecture document.
Path Sensitivity Model
File paths can reveal sensitive context even when file contents are not exposed (e.g., /hr/layoffs/2026/ or /legal/mna/target-company/).
Recommended controls:
- Store
raw_pathonly in the restricted metadata table - Expose
hashed_pathoranonymized_pathto general users - Use
path_classification: public / internal / restricted / confidential - Apply Lake Formation grants to curated views, not the base table
Raw Data Access Boundary
This architecture governs metadata access through S3 Tables and Lake Formation. It does not automatically replace:
- ONTAP/NFS/SMB file-system permissions
- S3 Access Point resource policies
- IAM permissions for raw file reads
- Application-level authorization
- Downstream use of presigned URLs or copied files
Treat metadata governance and raw data governance as two linked but separate control planes. Both must be configured for end-to-end security.
S3 Access Point Identity Boundary
Each FSx for ONTAP S3 Access Point has an associated file-system identity (OntapFileSystemIdentity — UNIX UID/GID or Windows domain user). All file access through that AP is authorized as that identity.
For each access point, document:
- IAM principals allowed to use the access point
- Access point policy (allowed S3 actions)
- Associated UNIX or Windows file-system identity
- Allowed volume / prefix scope
- Whether the identity can access files beyond what metadata governance intends
- Audit evidence location
If the AI enrichment access point uses a broad UNIX identity (e.g., root or a service account with wide read access), metadata-level Lake Formation controls do not prevent raw file reads through that AP. Scope the AP identity to minimum required access.
See security/s3-access-point-identity-matrix.yaml for the template.
Permission Identity Strategy
For multiprotocol environments (NFS + SMB + S3 AP):
- Record
discovery_protocol: nfs / smb / s3ap - Record
access_point_identity_type: unix / windows - Record
effective_reader_identity - Record
permission_source: nfs_mode / ntfs_acl / mixed - Do not assume metadata visibility implies raw file readability
Retention and Deletion Semantics
This PoC uses metadata records to represent file discovery and enrichment state. For regulated workloads, define:
- Metadata retention period (how long to keep catalog records)
- Raw file retention period (governed by storage policy, not this catalog)
- Anonymized metadata retention period
- Deletion request workflow (who can request, who approves, how it's executed)
- Snapshot expiration impact on deletion (Iceberg time travel may expose deleted metadata until snapshots expire)
- Audit evidence retention (keep deletion evidence longer than the data itself)
Important: Iceberg time travel is useful for recovery, but it means deleted metadata may still be queryable during the snapshot retention window. Align snapshot expiration with your data deletion SLA.
Snowflake-side retention: If redacted metadata is synced into Snowflake-managed tables, define Snowflake-side retention, Time Travel (default 1 day, up to 90 days), and Fail-safe (7 days, non-configurable) separately from Iceberg snapshot retention. Deletion from the Snowflake copy does not delete from the Iceberg source, and vice versa.
Approval Evidence Template (for Regulated Industries)
For organizations requiring formal access approval documentation:
Approval ID: <unique-id>
Data owner: <name/group>
Security owner: <name/group>
Platform owner: <name/group>
Allowed metadata columns: <columns>
Allowed raw file prefixes: <prefixes>
Allowed operations: metadata query only / raw file read / anonymized export
Review date: <date>
Expiration date: <date>
Evidence location: verification-evidence/<path>
Regulated Workload Readiness
For public sector, healthcare, financial services, and other regulated industries, validate the following before production deployment:
| Area | Requirement | Status in this PoC |
|---|---|---|
| Data residency | Metadata and raw files in same AWS Region | ✅ Single region (ap-northeast-1) |
| Encryption at rest | S3 Tables: SSE-S3; FSx: at-rest encryption | ✅ Default encryption |
| Encryption in transit | TLS 1.2+ for all API calls | ✅ AWS default |
| Raw data access boundary | File reads governed by S3 AP policy + ONTAP permissions | ✅ Documented |
| Metadata access boundary | Lake Formation table-level + CloudTrail audit | ✅ Verified |
| AI processing data flow | Content sent to Bedrock API, not stored by provider | ✅ Per AWS data protection policy |
| PII detection limitations | English (Comprehend) + Japanese (Claude) only | ⚠️ Other languages not covered |
| Human review workflow | Low-confidence queue defined | ✅ Design documented |
| Audit log retention | CloudTrail 90-day default; configure Trail for longer | ⚠️ Requires Trail setup |
| Deletion SLA | Define separately for metadata, raw files, and snapshots | ⚠️ Requires policy definition |
| Legal/compliance sign-off | Not in scope for this PoC | ❌ Required before production |
AI governance note: AI enrichment in this pattern is assistive metadata generation. It does not constitute authoritative regulatory classification. Final classification decisions, data handling approvals, and compliance certifications must be confirmed by data owners, security teams, legal counsel, and compliance officers.
Cross-Platform Access: The Current Reality
Fully Verified ✅
| Platform | Access Method | Status |
|---|---|---|
| Athena | Direct query via Glue federated catalog | ✅ Fully verified |
| Lambda/Python | PyIceberg SDK | ✅ Fully verified |
| EMR Spark | Glue Iceberg REST (EMR 7.13.0+) | ✅ Fully verified (SELECT, COUNT, time travel) |
| Snowflake | Glue Iceberg REST + VENDED_CREDENTIALS | ✅ Fully verified (CREATE TABLE, SELECT, COUNT, DESCRIBE, AUTO_REFRESH) |
| Snowflake | External Stage (FSx S3 AP) + TO_FILE + Cortex AI | ✅ Fully verified |
Expected / Requires Validation ⚠️
| Platform | Access Method | Status |
|---|---|---|
| EMR Trino | Glue Iceberg REST (EMR 7.13.0+) | ⚠️ Expected (same EMR SigV4 handling as Spark) |
| Redshift Spectrum | Same as Athena (Glue catalog) | ⚠️ Expected, not fully validated |
What Doesn't Work (Yet) ⚠️
| Platform | Tested method | Result | Tested | Status |
|---|---|---|---|---|
| Databricks SQL Warehouse |
CREATE CONNECTION TYPE iceberg_rest to S3 Tables REST |
CONNECTION_TYPE_NOT_SUPPORTED |
2026-05-31 | Observed limitation in this path |
| Databricks Spark cluster | Iceberg REST + SigV4 via spark.conf.set / cluster config |
NO_SUCH_CATALOG_EXCEPTION (UC blocks external catalog registration) |
2026-06-01 | Confirmed: UC Foreign Catalog required |
| Databricks Delta Sharing | Delta Sharing server accessing S3 AP-backed storage | Sharing server uses same UC storage credentials; cannot bypass session policy | 2026-06-01 | Confirmed limitation (not a workaround for S3 AP) |
| Databricks NFS → UC Volume | NFS mount path as UC External Volume | Cloud storage URIs only (s3://, abfss://, gs://); NFS/FUSE paths not supported | 2026-06-01 | Confirmed limitation; internal feature request exists |
| Snowflake | External Iceberg Table with S3 Tables direct REST endpoint | Not a supported catalog type (use Glue REST instead) | 2026-05-31 | Use Glue REST + VENDED_CREDENTIALS (✅ verified) |
| Snowflake | CATALOG INTEGRATION with default ACCESS_DELEGATION_MODE | Defaults to EXTERNAL_VOLUME_CREDENTIALS which triggers ListObjectsV2 (rejected by S3 Tables) | 2026-06-02 | ✅ Resolved: set explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
|
| Snowflake | Lake Formation column-level via VENDED_CREDENTIALS | AllowFullTableExternalDataAccess=false blocks all VENDED_CREDENTIALS access | 2026-06-08 | Use Snowflake Horizon (Row Access Policy / Dynamic Masking) for column governance |
| Snowflake Open Catalog | Polaris as Iceberg catalog | Not tested | TBD | Strategic alternative |
Databricks: Three Integration Paths to Validate
Update note (2026-06-09): We revalidated the S3 Tables path after Databricks announced GA for Foreign Iceberg and credential vending (May 28, 2026). Glue Connection creation and credential configuration succeeded, but Unity Catalog External Location validation still failed because S3 Tables internal buckets reject standard S3 API validation (HeadBucket/ListBucket). The S3 Tables path remains blocked in this tested Databricks UC configuration. A new Databricks support case has been submitted.
For Databricks business users: The value is not only table access. The value is turning previously invisible NAS files into governed metadata assets that can be searched, explained, lineage-tracked, and consumed from Databricks SQL, AI/BI, and dashboards.
In this PoC, CREATE CONNECTION TYPE iceberg_rest to the S3 Tables REST endpoint returned CONNECTION_TYPE_NOT_SUPPORTED on Databricks SQL Warehouse (tested 2026-05-31). This does not mean Databricks lacks Iceberg REST support — Databricks provides Unity Catalog Iceberg REST endpoints and Foreign Iceberg capabilities that evolve rapidly.
Confirmed Limitations (2026-06-01)
| Path | Result | Confirmed by |
|---|---|---|
| Spark cluster + Iceberg REST (spark.conf.set / cluster config) | ❌ UC blocks external catalog registration | Databricks support + our testing |
| Delta Sharing via S3 Access Point | ❌ Sharing server uses same UC storage credentials | Databricks support |
| NFS mount path as UC External Volume | ❌ Cloud storage URIs only (s3://, abfss://, gs://) | Databricks support |
| DataSync → S3 → UC External Delta Table → Delta Sharing | ✅ Works (Delta format required) | Databricks support |
Delta Sharing note: Delta Sharing is not a workaround for the FSx S3 Access Point session policy limitation in our tested path. The sharing server uses the same UC storage credentials and cannot bypass the session policy that blocks S3 AP ARNs. Note that Databricks has announced first-class Iceberg format support in Delta Sharing (Jan 2026), enabling providers to share Iceberg tables via the Iceberg REST Catalog API. This broader capability is not contradicted by our finding — our limitation is specific to S3 AP-backed storage access through UC credentials, not Delta Sharing's format support in general.
NFS Volume note: UC External Volumes require cloud storage URIs. An internal feature request (AHA) exists for EFS/NFS access via UC. Until this is implemented, DataSync → S3 → UC External Location remains the only supported path.
📢 Databricks users: If S3 Tables access from Databricks is important for your workflow, the UC Foreign Catalog for S3 Tables feature is being tracked internally by Databricks (request DB-I-15824). Contact your Databricks account team to express interest and increase prioritization. Snowflake achieved full S3 Tables access via VENDED_CREDENTIALS in June 2026 — the same architectural pattern should be feasible for UC.
Immediate workaround for Databricks: Use DataSync → S3 → UC External Table to sync metadata into a standard S3 location accessible by Unity Catalog. This is not zero-copy for the synced metadata, but raw files remain on FSx for ONTAP.
Path 1: Spark cluster + Iceberg REST (SigV4)
Best for technical validation and batch processing. Two endpoint options:
Tested 2026-06-01: On Databricks with Unity Catalog enabled, external Iceberg catalogs cannot be registered via
spark.conf.setor cluster Spark config. Unity Catalog controls catalog registration exclusively. Both Serverless (CONFIG_NOT_AVAILABLE) and All-Purpose clusters (NO_SUCH_CATALOG_EXCEPTION) fail. Unity Catalog Foreign Catalog (Path 2) is the required approach.
# Path 1a: Direct S3 Tables REST endpoint (used in this PoC)
spark.sql.catalog.s3tables.uri=https://s3tables.ap-northeast-1.amazonaws.com/iceberg
spark.sql.catalog.s3tables.rest.signing-name=s3tables
# Path 1b: AWS Glue Iceberg REST endpoint (recommended for production + Lake Formation)
spark.sql.catalog.s3tables.uri=https://glue.ap-northeast-1.amazonaws.com/iceberg
spark.sql.catalog.s3tables.rest.signing-name=glue
Common config for both:
spark.sql.catalog.s3tables=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3tables.catalog-impl=org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.s3tables.warehouse=arn:aws:s3tables:ap-northeast-1:<ACCOUNT>:bucket/fsxn-metadata-catalog
spark.sql.catalog.s3tables.rest.sigv4-enabled=true
spark.sql.catalog.s3tables.rest.signing-region=ap-northeast-1
Path 2: Unity Catalog Foreign Iceberg
Register external Iceberg tables into Unity Catalog if supported for the target catalog/storage path. Best for Databricks governance, lineage, and discovery. Verify refresh semantics and read/write limitations. Retested for S3 Tables on 2026-06-09: Glue Connection and credentials succeeded, but UC External Location validation failed because S3 Tables internal buckets reject standard S3 API validation. This path remains blocked for the tested S3 Tables configuration.
Documentation/version note: Databricks Iceberg capabilities are evolving rapidly. Earlier documentation and our initial validation showed limitations around Foreign Iceberg credential vending and automatic refresh behavior. After the May 2026 GA announcement, we revalidated the S3 Tables path on 2026-06-09. Credential configuration progressed further, but the tested path still failed at UC External Location validation against S3 Tables internal storage. Additionally, Databricks supports catalog federation with AWS Glue (Hive Metastore type), which can expose Glue-registered tables in UC. Whether a future Iceberg REST catalog federation path could bypass the S3 Tables internal bucket constraint is an open question.
Refresh semantics: If UC Foreign Iceberg works for S3 Tables via Glue REST, define refresh semantics explicitly. Our metadata catalog is append-only (new records added on file events). Analysts should know whether Databricks reads the latest Iceberg snapshot automatically or only after
REFRESH FOREIGN TABLE. Without auto-refresh, Athena and Databricks may show temporarily different results until the next refresh cycle. Plan for a scheduled refresh job or event-driven trigger.AWS reference for this path: AWS has published guidance on accessing S3 Iceberg tables from Databricks using the Glue Iceberg REST Catalog. This validates the architectural direction of B-4/B-5, though S3 Tables-specific compatibility requires separate validation.
Path 3: AWS Glue Catalog Federation with Databricks
AWS Glue can federate metadata from Databricks Unity Catalog for Iceberg tables. This is the reverse direction but useful for cross-platform governance patterns.
Federation Directionality
| Pattern | Direction | Primary governance | Best for |
|---|---|---|---|
| UC Foreign Catalog / Catalog Federation to Glue | Databricks reads AWS-managed metadata | Unity Catalog | Databricks users querying AWS Iceberg (S3 Tables) |
| AWS Glue federation to UC | AWS reads Databricks-managed metadata | Lake Formation / Glue | Athena/EMR/Redshift reading UC Iceberg/UniForm |
AWS reference: AWS has published guidance on accessing S3 Iceberg tables from Databricks using AWS Glue Iceberg REST Catalog, and on federating Databricks Unity Catalog data into AWS Glue Data Catalog. Both directions are documented.
Why Iceberg here (not Delta Lake)? This architecture uses Iceberg because S3 Tables is Iceberg-native, and the Iceberg REST endpoint enables multi-engine access (Athena, EMR, Snowflake). For Databricks-only environments, Delta Lake on S3 remains the natural choice. This pattern targets multi-platform scenarios.
Databricks UC Audit Logging for External Engines (Confirmed 2026-06-01)
External engine access via the UC Iceberg REST Catalog endpoint is fully auditable:
| Audit aspect | Confirmed behavior |
|---|---|
| Metadata requests (listNamespaces, listTables, loadTable) | ✅ Logged in system.access.audit under uniformIcebergRestCatalog
|
| Vended credential issuance | ✅ Logged as loadTableCredentials / generateTemporaryTableCredential
|
| Audit fields | user_identity, source_ip_address, user_agent, event_time, action_name, request_params |
| Distinguish external vs internal | ✅ service_name = 'uniformIcebergRestCatalog' (external) vs 'unityCatalog' (internal) |
Note: Databricks audit logs record credential issuance, not individual S3 file reads after credentials are vended. Complement with AWS CloudTrail + S3 access logging for file-level audit.
Databricks integration documentation:
- For UC Foreign Catalog validation steps, see
databricks/uc-foreign-iceberg-validation.md- For coexistence planning, see
databricks/coexistence-roadmap.md- For audit investigation, see
databricks/audit-correlation-guide.md
Snowflake: S3 Tables via Glue REST + VENDED_CREDENTIALS ✅
Working Configuration (Verified 2026-06-05)
Snowflake can directly query S3 Tables Iceberg tables via the Glue Iceberg REST endpoint with VENDED_CREDENTIALS. Here's the complete working setup:
-- 1. Catalog Integration (CRITICAL: explicit ACCESS_DELEGATION_MODE)
CREATE OR REPLACE CATALOG INTEGRATION s3tables_glue_rest_int
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'metadata'
REST_CONFIG = (
CATALOG_URI = 'https://glue.ap-northeast-1.amazonaws.com/iceberg'
CATALOG_API_TYPE = AWS_GLUE
CATALOG_NAME = '<ACCOUNT_ID>:s3tablescatalog/fsxn-metadata-catalog'
ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS -- MUST be explicit
)
REST_AUTHENTICATION = (
TYPE = SIGV4
SIGV4_IAM_ROLE = 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-snowflake-verification-role'
SIGV4_SIGNING_REGION = 'ap-northeast-1'
)
ENABLED = TRUE;
-- 2. Schema WITHOUT default EXTERNAL_VOLUME (critical)
CREATE SCHEMA FSXN_LAKEHOUSE.S3TABLES_VENDED;
USE SCHEMA FSXN_LAKEHOUSE.S3TABLES_VENDED;
-- 3. Table WITHOUT EXTERNAL_VOLUME parameter (critical)
CREATE ICEBERG TABLE s3tables_vended_creds_test
CATALOG = 's3tables_glue_rest_int'
CATALOG_TABLE_NAME = 'unstructured_files';
AWS prerequisites (must be completed before Snowflake configuration):
# Register S3 Tables resource with Lake Formation (--with-federation is REQUIRED)
aws lakeformation register-resource \
--resource-arn "arn:aws:s3tables:ap-northeast-1:<ACCOUNT_ID>:bucket/fsxn-metadata-catalog" \
--role-arn "arn:aws:iam::<ACCOUNT_ID>:role/S3TablesRoleForLakeFormation" \
--with-federation \
--region ap-northeast-1
# Grant SELECT + DESCRIBE to Snowflake's IAM role (table-level)
aws lakeformation grant-permissions \
--principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::<ACCOUNT_ID>:role/fsxn-snowflake-verification-role"}' \
--resource '{"Table":{"CatalogId":"<ACCOUNT_ID>:s3tablescatalog/fsxn-metadata-catalog","DatabaseName":"metadata","Name":"unstructured_files"}}' \
--permissions SELECT DESCRIBE \
--region ap-northeast-1
IAM role policy must include: glue:GetTable, glue:GetDatabase, glue:GetCatalog, lakeformation:GetDataAccess, s3tables:GetTableBucket, s3tables:GetTable, s3tables:GetNamespace, s3tables:GetTableData, s3tables:GetTableMetadataLocation.
Verify integration (expected DESCRIBE output after creation):
DESCRIBE CATALOG INTEGRATION s3tables_glue_rest_int;
| Property | Value |
|---|---|
| ENABLED | true |
| CATALOG_SOURCE | ICEBERG_REST |
| TABLE_FORMAT | ICEBERG |
| CATALOG_NAMESPACE | metadata |
| REST_CONFIG | {CATALOG_URI=https://glue.ap-northeast-1.amazonaws.com/iceberg, ...} |
| REST_AUTHENTICATION | {TYPE=SIGV4, SIGV4_IAM_ROLE=arn:aws:iam::<ACCOUNT_ID>:role/..., ...} |
| API_AWS_IAM_USER_ARN | arn:aws:iam::465774455528:user/<snowflake-user-id> |
| API_AWS_EXTERNAL_ID | <external-id-for-trust-policy> |
| REFRESH_INTERVAL_SECONDS | 30 |
Setup step: Copy
API_AWS_IAM_USER_ARNandAPI_AWS_EXTERNAL_IDfrom this output into your IAM role's trust policy to allow Snowflake to assume the role.
Verified Capabilities (2026-06-08)
| Operation | Result | Performance |
|---|---|---|
| CREATE ICEBERG TABLE | ✅ | 6.5s |
| SELECT * LIMIT 5 | ✅ (5 rows) | 1.9s |
| COUNT(*) | ✅ (170 rows) | 66ms |
| DESCRIBE TABLE | ✅ (23 columns) | 69ms |
| ALTER ... SET AUTO_REFRESH = TRUE | ✅ | 131ms |
| SHOW ICEBERG TABLES | ✅ (UNMANAGED type) | 567ms |
| Time Travel | ✅ (available, snapshot-dependent) | — |
Screenshots (URL bar excluded, S3 Tables internal bucket masked):

COUNT() returns 170 rows in 66ms*

DESCRIBE TABLE shows all 23 Iceberg columns

SHOW ICEBERG TABLES confirms UNMANAGED type with S3TABLES_GLUE_REST catalog

SELECT * LIMIT 5 returns actual file metadata from S3 Tables
AUTO_REFRESH + Time Travel verification:

AUTO_REFRESH verified: PyIceberg appended 1 record → Snowflake COUNT() automatically updated from 170 to 171 within 30 seconds*

Time Travel: querying 20 minutes ago returns 170 (before the append), confirming snapshot history is accessible
About FILE_PATH: The
FILE_PATHcolumn shows the S3 path used during metadata ingestion (via FSx for ONTAP S3 Access Point). This is the path recorded in the Iceberg metadata catalog — it does not mean the files were copied to S3. The actual files remain on FSx for ONTAP and are accessible via NFS, SMB, or S3 Access Point depending on your application's protocol.
Key Insight: Why Previous Attempts Failed
ACCESS_DELEGATION_MODE defaults to EXTERNAL_VOLUME_CREDENTIALS when not explicitly specified. In this default mode, Snowflake validates storage access through the External Volume path, which triggers ListObjectsV2 against S3 Tables internal buckets — an operation that returns MethodNotAllowed.
With VENDED_CREDENTIALS explicit:
- Snowflake calls Glue REST
loadTable - Lake Formation (via
GetTemporaryGlueTableCredentials) returns temporary storage credentials in theloadTableresponse config map - Snowflake uses these credentials to access data files directly by exact path
-
No
ListObjectsV2is required — Snowflake reads files by exact path from Iceberg metadata
Note: The Glue REST endpoint does not implement the standard Iceberg REST
/credentialsendpoint. Credential vending works through Lake Formation's proprietary mechanism embedded in theloadTableresponse. This is transparent to Snowflake when configured correctly.
Governance Limitation: Lake Formation Column-Level (2026-06-08)
Lake Formation column-level filtering is NOT enforced via the VENDED_CREDENTIALS path:
- When
AllowFullTableExternalDataAccess = false, the entire VENDED_CREDENTIALS path is blocked - Explicit column/table-level grants +
ExternalDataFilteringAllowListdo not resolve this -
AllowFullTableExternalDataAccess = trueis required for VENDED_CREDENTIALS to function
Technical context:
AllowFullTableExternalDataAccesscontrols whether external engines (those using Lake Formation credential vending) can access table data without per-table SELECT grants. When set tofalse, fine-grained column/row filtering is the intended enforcement mechanism — but for S3 Tables accessed via VENDED_CREDENTIALS, this currently results in complete access denial rather than filtered access. This may be a service-specific constraint of the S3 Tables federated catalog path, or it may require additionalAllowExternalDataFiltering+ExternalDataFilteringAllowListconfiguration that was not effective in our testing. A feature request has been submitted to AWS.
Workaround: Use Snowflake Horizon for column-level governance:
-- Row Access Policy: restrict by sensitivity_level
CREATE OR REPLACE ROW ACCESS POLICY metadata_sensitivity_filter AS
(sensitivity_level VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN IS_ROLE_IN_SESSION('SECURITY_ADMIN') THEN TRUE
WHEN sensitivity_level IN ('public', 'internal') THEN TRUE
ELSE FALSE
END;
ALTER TABLE s3tables_vended_creds_test ADD ROW ACCESS POLICY
metadata_sensitivity_filter ON (sensitivity_level);
-- Dynamic Data Masking: hide embedding vectors from non-ML roles
CREATE OR REPLACE MASKING POLICY mask_embedding AS
(val BINARY) RETURNS BINARY ->
CASE
WHEN IS_ROLE_IN_SESSION('ML_ENGINEER') THEN val
ELSE NULL
END;
ALTER TABLE s3tables_vended_creds_test MODIFY COLUMN
embedding_vector SET MASKING POLICY mask_embedding;
Snowflake Iceberg Access Modes (Summary)
| Access mode | Best for | Status |
|---|---|---|
| Glue REST + VENDED_CREDENTIALS | S3 Tables direct query | ✅ VERIFIED |
| External Stage (FSx S3 AP) + TO_FILE | File AI analysis (Cortex COMPLETE) | ✅ VERIFIED |
| Metadata sync to Snowflake table | BI / Cortex Search / governance | Available |
| Object Store Catalog | Direct metadata file read | ❌ Blocked (S3 Tables internal bucket) |
| Snowflake Open Catalog (Polaris) | Alternative Iceberg catalog | Not tested |
📖 Investigation History (2026-06-01 to 2026-06-05) — click to expand
2026-05-31: Tested S3 Tables direct REST endpoint as External Iceberg catalog → not a supported catalog type.
2026-06-01: Created CATALOG INTEGRATION using ICEBERG_REST + AWS_GLUE + VENDED_CREDENTIALS. DESCRIBE succeeded but CREATE ICEBERG TABLE failed with "Failed to retrieve credentials from the Catalog". Root cause identified: Glue REST does not implement /credentials endpoint (UnknownOperationException).
2026-06-02: AWS Support confirmed Lake Formation uses proprietary mechanism (GetTemporaryGlueTableCredentials) for credential vending, not standard Iceberg REST /credentials. Snowflake Support confirmed Error 004174 occurs when s3.access-key-id/secret/token absent from loadTable response.
2026-06-02: Tested Object Store catalog and EXTERNAL_VOLUME_CREDENTIALS mode — both blocked by S3 Tables internal bucket rejecting ListObjectsV2.
2026-06-03: Discovered register-resource --with-federation was missing. After setup, loadTable response included credentials. However, CREATE TABLE still failed at storage validation (ListObjectsV2).
2026-06-05: Snowflake Support identified the critical distinction: ACCESS_DELEGATION_MODE defaults to EXTERNAL_VOLUME_CREDENTIALS. Explicitly setting VENDED_CREDENTIALS + schema without External Volume + CREATE TABLE without External Volume parameter → SUCCESS. CREATE TABLE + SELECT both working.
2026-06-08: Additional testing confirmed COUNT(*), DESCRIBE, AUTO_REFRESH, SHOW ICEBERG TABLES all working. Lake Formation column-level filtering NOT enforced via this path (AllowFullTableExternalDataAccess=false blocks all access).
External Stage note: Snowflake External Stage against the FSx S3 Access Point alias was verified in this PoC (2026-05-31, ap-northeast-1). Update (2026-06-02): TO_FILE (Cortex COMPLETE multimodal) also verified working — Claude Sonnet 4.5 can directly read files from FSx for ONTAP via S3 AP-backed External Stage. See
snowflake/external-stage-fsx-s3ap-validation.mdfor exact DDL and verified operations.
Snowflake Metadata Activation Pattern
If you sync only the metadata into Snowflake (not raw files), you preserve the zero-copy principle for actual data while enabling Snowflake-native use cases:
- Governed metadata analytics and executive dashboards
- File inventory and PII coverage reporting
- Cortex Search over redacted summaries (RAG on metadata)
- Snowflake Intelligence / Cortex Analyst style business Q&A
- Row Access Policies and Dynamic Masking on synced metadata
Horizon Catalog note: When metadata reaches Snowflake, Snowflake governance features such as Row Access Policies and Dynamic Masking can be applied to Snowflake-managed access paths. For external engine access via Iceberg REST, validate the exact Open Catalog / Horizon behavior for your target engine and security model.
Metadata sync best practice: Sync curated latest-record metadata, not the append-only base table, unless analysts explicitly need history. Preserve
scan_run_id,change_type, andis_deletedfor audit and reconciliation. UseMERGE INTOkeyed byfile_idorpath_hashto make metadata activation idempotent. Seesnowflake/metadata-sync-example.sqlfor the full pattern.Governance policy mapping: When syncing metadata into Snowflake, map AWS-side fields such as
sensitivity_level,tenant_id,pii_status, andpath_classificationto Snowflake tags, masking policies, and row access policies. Track policy drift between Lake Formation and Snowflake governance. Seesnowflake/path-decision-guide.mdfor the full policy mapping.
Snowflake Cortex Search Activation Pattern
If redacted metadata and summaries are synced into Snowflake, Cortex Search can provide Snowflake-native enterprise search and RAG over metadata — without managing embeddings, infrastructure, or search quality tuning.
Why Cortex Search here:
- Business users can search approved metadata without operating a separate vector database
- RAG and enterprise search can run over redacted summaries already governed in Snowflake
- Search quality, embedding management, and index refresh are delegated to Snowflake-managed services
- This is best suited for Snowflake-first organizations that want business-facing discovery inside the AI Data Cloud
Use Cortex Search for:
- Executive metadata search (natural language queries over file inventory)
- File inventory Q&A (powered by LLM + retrieval)
- PII coverage reporting and compliance dashboards
- Governed search over redacted summaries
OpenSearch Serverless NextGen remains the AWS-native serving index for this PoC. Cortex Search is an optional Snowflake-native alternative for organizations that standardize on Snowflake for business discovery.
Role separation: S3 Tables / Iceberg remains the metadata source of truth. OpenSearch (AWS path) or Cortex Search (Snowflake path) are serving indexes for search UX. Choose based on your primary platform. Cortex Search operates over redacted summaries and metadata synced into Snowflake, not raw files, unless the customer explicitly chooses to copy/extract document content into Snowflake (which would break the zero-copy raw data principle).
Cortex Search scope: Cortex Search should operate on redacted metadata and summaries by default. If raw document content is extracted or copied into Snowflake for Cortex use cases, treat that as a separate data movement decision with its own governance, retention, and cost model.
Snowflake activation cost drivers: Snowflake activation introduces separate cost drivers from the AWS-native catalog: warehouse compute for metadata sync tasks and dashboards, Cortex Search service usage (based on corpus size and query volume), task/stream orchestration for refresh, and small metadata storage. These costs should be modeled separately from the AWS-native catalog cost ($114/month estimate in Part 1 does not include Snowflake-side compute).
Retention alignment: Confirm Snowflake account edition, table type, and retention settings before promising deletion SLAs. Snowflake Time Travel (1–90 days) and Fail-safe (7 days) operate independently from Iceberg snapshot expiration. Snowflake-side deletion evidence should be retained separately from Iceberg snapshot expiration evidence.
Snowflake Metadata Product Contract
When activating metadata in Snowflake, expose a curated subset as the governed metadata product:
Recommended curated columns:
| Column | Purpose | Governance |
|---|---|---|
| file_id | Unique identifier | — |
| business_domain | Organizational grouping | Row access policy |
| file_type | File format | — |
| classification | AI-generated classification | — |
| sensitivity_level | Data sensitivity tier | Snowflake tag + masking policy |
| pii_status | PII detection result | Access policy / dashboard filter |
| redacted_summary | AI-generated (PII-free) summary | Cortex Search source column |
| owner_team | Business ownership | Business glossary / stewardship |
| last_seen_at | Last scan timestamp | — |
| data_quality_status | Enrichment quality flag | — |
Snowflake governance mapping:
-
sensitivity_level→ Snowflake tag + masking policy -
tenant_id/business_domain→ row access policy -
pii_status→ access policy / dashboard filter -
redacted_summary→ Cortex Search source column -
owner_team→ business glossary / stewardship workflow
Databricks Metadata Activation Pattern
If UC Foreign Catalog is not yet validated for your S3 Tables path, sync only the redacted metadata into a UC-managed Delta table. This preserves the zero-copy principle for raw files while enabling Databricks-native use cases:
- Databricks SQL dashboards and executive reporting
- AI/BI Genie over curated metadata (natural language queries)
- UC lineage and audit on metadata usage
- ML feature generation from file metadata
- Operational reporting on PII coverage and enrichment backlog
Raw files remain on FSx for ONTAP. Only the small metadata table (~MB scale for 100K files) is synced.
This is analogous to the Snowflake metadata activation pattern: it copies only curated metadata, not the original unstructured files. Both patterns preserve the zero-copy principle for raw data.
Databricks Raw File Access Decision:
Requirement Recommended path Governed metadata analytics only UC Foreign Catalog (if validated) or sync metadata to UC Delta Raw file processing in Databricks DataSync → S3 → UC External Volume Zero-copy raw file access from Databricks Not supported in validated paths (NFS mount works but without UC governance) Business discovery / BI Sync redacted metadata to UC Delta table If metadata is synced into Databricks for BI, include Databricks SQL / Jobs compute cost in the activation model. This does not affect raw-file zero-copy storage, but it is part of the business-facing analytics cost.
Other Lakehouse Engines to Validate
Beyond Databricks and Snowflake, the most natural validation targets for this metadata catalog are:
| Engine | Access path | Likely fit | Validation priority |
|---|---|---|---|
| Trino / Starburst | Glue Iceberg REST or S3 Tables REST | Federated SQL, ad hoc query | High |
| EMR Spark | Glue Iceberg REST (native since EMR 7.5.0+) | Bulk backfill, batch enrichment | High |
| Redshift Spectrum | Glue catalog (external schema) | DWH integration, BI | Medium |
| Dremio | Glue catalog or Iceberg REST | Query acceleration, BI | Medium |
| StarRocks / Doris | Glue Iceberg REST | Low-latency serving queries | Medium |
| Apache Flink | Glue Iceberg REST | Streaming metadata updates | Low |
| dbt (via Athena) | dbt-athena + Iceberg materialization | Analytics engineering, governed marts | Medium |
| Apache NiFi | Iceberg REST or Polaris | Event-driven ingestion | Low |
These engines should be validated against:
- S3 Tables direct REST vs AWS Glue Iceberg REST
- Read vs write capability
- Lake Formation behavior (credential vending, column/row filtering)
- Snapshot freshness after external writes
- Latest-record view compatibility
- Case-sensitivity and lowercase naming requirements
Key finding from validation (2026-06-08): AWS Glue Iceberg REST supports SigV4-authenticated catalog access. Lake Formation credential vending works through a proprietary mechanism (
GetTemporaryGlueTableCredentials). Snowflake requires explicitACCESS_DELEGATION_MODE = VENDED_CREDENTIALS— the default mode fails. Engines that can sign requests with their own IAM credentials (EMR Spark ✅ verified, Trino on EMR expected, PyIceberg ✅ verified) work out of the box. Snowflake also works when configured correctly (✅ verified 2026-06-05). EMR requirement: 7.13.0+ (7.5.0 has a credential resolution bug). Governance note: Lake Formation column-level filtering is NOT enforced via the VENDED_CREDENTIALS path for Snowflake.Trino note: AWS has published guidance on querying S3 Tables from Trino using the Iceberg REST endpoint. Trino's Iceberg connector supports REST catalogs natively, making it one of the most straightforward third-party validation targets.
EMR Spark note: For large-scale backfill or re-enrichment (100K+ files), Spark on EMR Serverless or EMR on EC2 can be used as an alternative to Lambda/Fargate. Use Glue Iceberg REST for centralized metadata access with Lake Formation governance. Verified (2026-06-02): EMR Serverless Spark 7.13.0 successfully reads S3 Tables metadata via Glue Iceberg REST — SHOW NAMESPACES, SHOW TABLES, SELECT, COUNT, and snapshot history all work. Requires EMR 7.13.0+ (7.5.0 has a credential resolution bug for S3 Tables warehouse format).
Redshift note: Validate separately from Athena — external schema setup, Glue statistics, Lake Formation permissions, and query latency against latest-record views may differ.
For the full compatibility matrix, see lakehouse-tools/tool-compatibility-matrix.yaml.
Catalog Authority Rule
For each Iceberg table, define exactly one authoritative catalog for metadata pointer and commit coordination. Do not operate S3 Tables, Polaris, Gravitino, Nessie, and Glue as independent writable catalogs for the same table unless the integration explicitly supports federation without dual writes.
┌──────────────────────┐
│ Authoritative Catalog│
│ (ONE per table) │
│ • S3 Tables + Glue │
│ (this PoC) │
└──────────┬───────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
Read-only Read-only Read-only
consumers consumers consumers
(Trino, (Databricks, (Snowflake,
Dremio, UC Foreign Cortex,
StarRocks) Catalog) Open Catalog)
Split-brain warning: If two catalogs independently write to the same Iceberg table, snapshot pointers can diverge, causing data loss or corruption. Federation (one writer, many readers) is safe. Dual-write is not.
The Bigger Picture
S3 Tables (Iceberg)
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
Athena ✅ Databricks ❌ Snowflake ✅
EMR Spark ✅ (UC Foreign (Glue REST +
PyIceberg ✅ path still VENDED_CREDENTIALS
blocked in verified 2026-06-05)
tested config)
Databricks integration summary (confirmed 2026-06-01):
- Direct S3 AP access: ❌ (UC session policy)
- NFS mount → UC Volume: ❌ (cloud URI only)
- Delta Sharing via S3 AP: ❌ (same credentials)
- DataSync → S3 → UC: ✅ (supported workaround, not zero-copy for synced data)
- UC Foreign Catalog / Foreign Iceberg via Glue: ❌ Retested 2026-06-09; Glue Connection and credentials succeeded, but UC External Location validation failed against S3 Tables internal storage. Support case submitted.
For capability-level details such as read, write, time travel, metadata tables, and governance behavior, see verification-evidence/cross-platform-compatibility.yaml.
This is a temporary gap. S3 Tables is relatively new (GA Dec 2024), and cross-platform federation is actively being developed. Feature requests have been filed with both platforms. Timeline for native S3 Tables support is unknown, but the Iceberg ecosystem is converging rapidly — Unity Catalog 2.0's native Iceberg support and Snowflake's Open Catalog (Polaris) both point toward broader interoperability.
Catalog Decision Guide
In the Iceberg world, the catalog is the system of record for table metadata pointers and atomic operations. Choose based on your primary platform:
| Primary platform | Recommended catalog | Notes |
|---|---|---|
| AWS-first / Athena-first | S3 Tables + Glue/Lake Formation | Used in this PoC |
| Databricks-first | Unity Catalog Managed/Foreign Iceberg | Best for UC governance, lineage, discovery |
| Snowflake-first | Snowflake Open Catalog (Polaris) | Best for Snowflake-governed Iceberg interoperability; validate external engine governance behavior |
| Neutral / OSS-first | Apache Polaris or other REST catalog | Maximum portability |
Dual catalog warning: Avoid running two authoritative catalogs for the same Iceberg table. Use Snowflake Open Catalog / Polaris when Snowflake or a neutral REST catalog should be authoritative. Use S3 Tables when AWS-native Athena / Lake Formation / Glue governance is authoritative. If both platforms need access, use federation (one authoritative catalog, others read via REST).
When to Consider Snowflake Open Catalog / Polaris
Use S3 Tables + Glue/Lake Formation when AWS-native governance is authoritative (this PoC).
Consider Snowflake Open Catalog / Polaris when:
- Snowflake should be the primary governance and interoperability plane
- Multiple engines need Iceberg REST access through a neutral catalog
- Snowflake-managed Iceberg or Snowflake-first AI/Data Cloud workflows are the center of gravity
- You want managed Polaris instead of operating your own REST catalog
This would be a different authoritative-catalog design from the current PoC and should not be mixed as a second writer for the same table.
Databricks-first note: For organizations standardizing on Databricks, consider whether the metadata catalog itself should be managed in Unity Catalog as Managed Iceberg or Delta + UniForm, then exposed to AWS engines through Glue federation to UC or the UC Iceberg REST endpoint. Use S3 Tables when AWS-native Athena/Lake Formation is the primary governance path. The choice depends on which governance plane (UC or Lake Formation) is authoritative for your organization.
Format Decision for Databricks Environments
| Option | Best for | Tradeoff |
|---|---|---|
| S3 Tables Iceberg | AWS-first Athena/LF governance | UC integration pending (Foreign Catalog validation) |
| UC Managed / Foreign Iceberg | Databricks-first open format governance | Validate current feature availability, region support, and limitations |
| Delta + UniForm | Databricks-native pipelines + Iceberg read compatibility | Iceberg metadata generated asynchronously; non-Databricks writes constrained |
| Metadata sync to Delta | BI activation in Databricks SQL | Metadata copy, but raw files remain zero-copy |
Summary: What We Built
| Layer | Technology | Status |
|---|---|---|
| Storage | FSx for ONTAP (files) + S3 Tables (metadata) | ✅ Verified |
| AI | Bedrock Claude Vision + Titan Embeddings V2 | ✅ Verified |
| Search | OpenSearch Serverless NextGen (scale-to-zero) | ✅ Verified |
| Governance | Lake Formation (table-level) + CloudTrail | ✅ Verified |
| PII | Comprehend (EN) + Bedrock Claude (JA) | ✅ Verified |
| Cross-platform | Athena ✅, EMR Spark ✅, PyIceberg ✅, Snowflake ✅, Databricks ⚠️ | Mostly verified |
The Numbers
- 42 seconds: Full demo execution time
- $0.07: Total demo cost
- Near $0 idle compute/search cost: Persistent metadata, logs, and audit trails may still incur small charges
- $114/month: Projected cost at 100K files, 1000 changes/day
- 95%: Storage cost reduction vs S3 full copy
- 0.95: AI classification confidence (invoice detection)
- 7/7: PII entities detected and redacted
For regulated workloads, align Iceberg snapshot retention with deletion SLAs and audit evidence retention.
What's Next for This Project
- Monitor support cases: Databricks UC Foreign Catalog for S3 Tables — timeline unknown
- Production hardening: SQS batching, DLQ alerting, reconciliation jobs
- Multi-language PII: Extend beyond EN/JA to other languages
- Cost optimization: Provisioned Throughput for high-volume Bedrock usage
- Production semantics: File identity, latest-record views, index reconciliation, and snapshot retention alignment
- ONTAP production hardening: S3 Access Point identity matrix, FPolicy event filtering, SnapMirror catalog rebinding, and FSx performance dashboard
- Snowflake governance: Implement Horizon Row Access Policies and Dynamic Data Masking for column-level protection (since Lake Formation column-level is not enforced via VENDED_CREDENTIALS)
Get Involved
- ⭐ Star the repo if this was useful
- 🐛 Open an Issue for questions or suggestions
- 🍴 Fork and adapt for your own unstructured data catalog
This concludes the 3-part series. All code is at github.com/Yoshiki0705/fsxn-lakehouse-integrations. Questions? Open a GitHub Issue.
Governance disclaimer: This article provides governance guidance and architectural patterns. It does not substitute for legal or compliance judgment. Final regulatory determinations should be confirmed with legal and compliance teams.
Top comments (0)