Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on Jun 8 • Edited on Jun 15

From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

#aws #iceberg #datalake #amazonfsxfornetappontap

What Works Now vs What Requires Validation

This article separates verified AWS-native capabilities from cross-platform paths that still require validation. The core pattern — keeping raw files on FSx for ONTAP and cataloging only metadata in S3 Tables — is verified. Databricks paths are still evolving. Snowflake Glue REST + VENDED_CREDENTIALS and External Stage paths are verified in this PoC, with governance limitations noted below. Validate all cross-platform paths in your own environment before production use.

Component	Status	Notes
AWS Native PoC (Athena + S3 Tables + Bedrock + OpenSearch + Lake Formation)	✅ Verified	Full end-to-end in 42 seconds
Glue Iceberg REST endpoint access	✅ Verified	Both S3 Tables REST and Glue REST confirmed
Lake Formation table-level governance	✅ Verified	Grant/revoke/audit working
Lake Formation column-level exclusion	⚠️ Observed limitation	Failed on tested federated catalog path
Databricks SQL Warehouse direct	⚠️ Observed limitation	`iceberg_rest` connection type not supported
Databricks Spark + Iceberg REST	❌ Blocked by UC	spark.conf.set and cluster config both fail; UC Foreign Catalog required
Databricks UC Foreign Catalog	❌ Still blocked	Retested post-Foreign Iceberg GA (2026-06-09): Glue Connection ✅, Credentials ✅, but External Location fails — S3 Tables internal bucket rejects standard S3 API validation. No bypass available.
Databricks Delta Sharing via S3 AP	❌ Confirmed	Sharing server uses same UC credentials; not a workaround for S3 AP session policy
Databricks NFS → UC Volume	❌ Confirmed	Cloud storage URIs only; internal feature request exists
Databricks UC audit logging	✅ Confirmed	External engine access fully logged
Snowflake via Glue REST (VENDED_CREDENTIALS)	✅ Verified	Explicit `ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS`; CREATE TABLE + SELECT + COUNT + AUTO_REFRESH all working (2026-06-05)
Snowflake External Stage (FSx S3 AP)	✅ Verified	LIST, SELECT/COPY, and TO_FILE + Cortex AI all verified

Important distinction: This pattern does not use FSx for ONTAP S3 Access Points as an Iceberg warehouse. Raw files stay on FSx for ONTAP, while only the metadata catalog is written to S3 Tables. Direct Iceberg table writes to FSx for ONTAP S3 Access Points are tracked separately as a known limitation because Iceberg commit behavior and S3FileIO compatibility require additional validation.

This is an Iceberg Adoption Pattern, Not a Raw-Data Migration

This pattern does not convert the original unstructured files into Iceberg table data. Instead, it adopts Iceberg for the metadata layer only.

Scope	What happens
Data files	Not migrated. Raw files remain on FSx for ONTAP.
Metadata table	Newly created as an Iceberg table on S3 Tables.
Processing jobs	Metadata scan and AI enrichment jobs write append-only metadata.
Consumers	Athena, EMR, Snowflake, Databricks, and BI/search tools consume curated metadata views.

Storage Boundary: What Moves and What Doesn't

FSx for ONTAP S3 Access Point:
  ✅ Raw file READ path only (AI enrichment input)
  ❌ NOT an Iceberg warehouse
  ❌ NOT a table commit target
  ❌ NOT bulk-copied to S3

S3 Tables:
  ✅ Iceberg METADATA table (file catalog)
  ✅ Metadata source of truth
  ✅ Query and governance target

Data movement disclosure (for regulated environments): Raw files are NOT bulk-copied to S3. However, during AI enrichment, selected file content is temporarily read via the S3 Access Point and sent to Amazon Bedrock APIs for classification/embedding. Per AWS Bedrock data protection policy, model providers have no access to customer prompts or completions. Extracted/redacted metadata and embeddings are written to S3 Tables, OpenSearch, and optionally to Snowflake or Databricks depending on the activation path. Define your data flow boundary documentation before regulated-workload deployment.

The Problem: Most Enterprise Unstructured Data is Difficult to Discover and Govern

Most organizations store terabytes of unstructured data — PDFs, images, CAD files, sensor logs — on network-attached storage. This data is:

Undiscoverable: "Where is that invoice from last quarter?" requires manual searching or asking colleagues
Governed at the file-system layer, but not classified or searchable from analytics and AI workflows
Audit trails may exist at the file-system layer, but they are often not unified with analytics and AI query activity

Think of this as unstructured-data modernization: inventory first, classify selectively, govern metadata, and activate only what is needed — without bulk-copying the raw files.

Business Outcomes (Beyond Technical Metrics)

This pattern is not only about faster file search. It is about:

Reducing dataset discovery lead time for AI projects (days → hours)
Improving PII visibility across the organization (unknown → 95%+ coverage target)
Lowering duplicate storage cost ($230-256/month eliminated for 10TB)
Creating governed metadata products for analytics and AI teams
Enabling AI-readiness without raw-data copy or migration
Activating governed metadata in Snowflake AI Data Cloud for Cortex Search, semantic Q&A, executive dashboards, and business-facing file discovery

The traditional solution? Copy everything to S3 and build a catalog. But at 10TB, that's ~$230-256/month just for the copy — plus sync pipelines, duplicate governance, and data drift.

The Solution: Hot Metadata × Cold Data

What if we could catalog every file without moving it?

┌─────────────────────────────────────────────────────────┐
│  HOT: Metadata (Apache Iceberg on S3 Tables)            │
│  • File path, type, size, timestamps                    │
│  • AI classification + confidence score                 │
│  • Vector embedding (1024-dim, similarity search)       │
│  • PII detection flag                                   │
│  • Cost: ~$5-15/month for 100K files                    │
└────────────────────────┬────────────────────────────────┘
                         │ file_path reference
┌────────────────────────▼────────────────────────────────┐
│  COLD: Actual Files (FSx for ONTAP)                     │
│  • PDF, images, CAD, video, audio, logs                 │
│  • Deduplication (50-70% storage savings typical*)      │
│  • NFS/SMB (existing workflows) + S3 AP (AI/analytics)  │
│  • No bulk raw-data copy required                       │
└─────────────────────────────────────────────────────────┘

Key insight: Keep the data where it is. Move only the metadata into a queryable format.

Architecture

FSx for ONTAP ──S3 Access Point──→ AI Enrichment (Bedrock)
       │                                    │
       │                                    ▼
       │                          S3 Tables (Iceberg)
       │                                    │
       │                                    ▼
       │                          ┌──────────────────┐
       │                          │ Query Engines    │
       │                          │ • Athena (SQL)   │
       │                          │ • OpenSearch     │
       │                          │   (vector kNN)   │
       │                          │ • Lake Formation │
       │                          │   (governance)   │
       │                          └──────────────────┘
       │
       └──NFS/SMB──→ Existing applications (unchanged)

Observability (production add-on):

       ┌──────────────────────────────────────┐
       │  • CloudWatch Metrics + Alarms       │
       │  • CloudWatch Logs (Lambda/SQS)      │
       │  • CloudTrail (governance audit)     │
       │  • OpenSearch Dashboards (search UX) │
       │  • FSx metrics (throughput, IOPS,    │
       │    latency, capacity pool reads)     │
       └──────────────────────────────────────┘

Components:

Component	Role	Cost
FSx for ONTAP S3 Access Point	Read files for AI processing (no copy)	Included with FSx
S3 Tables	AWS managed Apache Iceberg table service (auto-compaction, REST endpoint)	~$5/month metadata
Bedrock Claude Vision	Image classification	~$0.01/file in this demo
Titan Embeddings V2	1024-dim vectors for similarity search	$0.00002/1K input tokens
OpenSearch Serverless NextGen	kNN vector search (scale-to-zero)	$0 idle compute when inactive
Lake Formation	Metadata access governance	No additional Lake Formation charge

S3 Tables Iceberg REST endpoint: https://s3tables.<region>.amazonaws.com/iceberg
Check S3 Tables availability for regional support before deployment.

Deduplication ratio is a general ONTAP range. Actual savings depend on data characteristics and were not measured in this PoC.

PoC Results (Verified 2026-05-31)

We built and verified this end-to-end in a single day. Here's what we measured:

S3 Tables Access Paths: Which Endpoint Should You Use?

Access path	Best for	Governance path	Verified
S3 Tables Iceberg REST (`s3tables.<region>.amazonaws.com/iceberg`)	Direct Iceberg client / simple PoC	IAM + S3 Tables permissions	✅
AWS Glue Iceberg REST (`glue.<region>.amazonaws.com/iceberg`)	Production analytics integration	IAM + Lake Formation	✅
Athena via Glue federated catalog	SQL analytics	Lake Formation + Athena	✅
PyIceberg local client	Lightweight validation	IAM/LF depending on endpoint	✅

For production workloads with centralized governance, the AWS Glue Iceberg REST endpoint is recommended over the S3 Tables direct endpoint. See AWS docs.

Catalog authority rule: S3 Tables + Glue is the authoritative catalog for this metadata table in this PoC. Other engines should consume the table through the authoritative catalog or a controlled metadata activation path. Do not configure multiple writable catalogs for the same Iceberg table — dual-write causes split-brain and potential data corruption.

Athena Iceberg behavior depends on Athena engine version, Iceberg version, Glue/Lake Formation integration, and table maintenance state. Validate DDL/DML requirements separately before using this as a write-heavy production catalog.

Verification details are recorded in evidence-record.yaml and cross-platform-compatibility.yaml.

Before vs After

Metric	Before	After	Improvement
File discovery time	Minutes-hours	< 2 seconds	100x+ at scale
AI classification	Manual	Automatic (6 sec/file)	Fully automated
Storage cost (10TB)	~$250/month (S3 copy)	$5-15/month (metadata only)	95% reduction
Metadata query governance	Not applicable	100% in this PoC	Complete for metadata queries
Idle compute/search cost	N/A	Near $0 when inactive	Persistent metadata/logs may still incur small charges

Search Time Scaling (Measured + Projected)

Files	ListObjectsV2	Athena SQL	Speedup
40	892 ms	3.0 sec	0.3x
1,000	22.3 sec	1.8 sec	12x
10,000	3.7 min	1.8 sec	124x
100,000	37.2 min	1.8 sec	1,239x
1,000,000	371.7 min	1.8 sec	12,389x

At 40 files, ListObjectsV2 is faster — Athena has cold start overhead. Athena query time does not scale linearly with the number of files on FSx because it queries the Iceberg metadata table instead of listing the raw file namespace. In this controlled demo, the query stayed around ~1.8 seconds for projected file counts, but production latency depends on Iceberg metadata size, manifest count, predicate selectivity, Athena cold start, and table maintenance state.

Projection method: ListObjectsV2 latency was extrapolated linearly from the measured 40-file scan. This is intentionally conservative for demonstrating namespace-scan behavior, but it is not a service benchmark.

The 42-Second Demo

Our complete demo runs all 8 steps in 42 seconds:

Step 1: Before/After search comparison     ✅ (ListObjectsV2 vs Athena)
Step 2: Infrastructure deploy              ✅ (CloudFormation, skippable)
Step 3: Metadata scan (40 files)           ✅ (3 seconds)
Step 4: AI enrichment (Bedrock Vision)     ✅ (invoice → 0.95 confidence)
Step 5: Athena query + Time Travel         ✅ (< 2 seconds)
Step 6: Vector similarity search           ✅ (kNN score 0.67)
Step 7: PII detection + anonymization      ✅ (7/7 entities, all redacted)
Step 8: Cost & ROI analysis                ✅ ($0.07 total demo cost)

Total demo cost: $0.07. After the demo, the compute/search components can scale to zero. If you retain S3 Tables metadata, logs, or audit trails, small storage/logging charges may still apply.

AI Classification Results

File	Classification	Confidence
invoice_sample.png	Invoice	0.95
product_inspection.png	Pie Chart	1.0
sensor_dashboard.png	IoT Sensor Dashboard	0.9

In this demo, Bedrock Claude Vision classified sample images at roughly $0.01/file with sub-10-second latency. Production cost and latency depend on image size, prompt length, model version, and retry behavior.

Vector Similarity Search

Query: "find invoice or payment documents"
→ invoice_sample.png (score: 0.6749)

OpenSearch Serverless with scale-to-zero capability (GA May 2026) provides kNN search — no minimum cost when idle. Cold start is ~10-30 seconds, warm queries are ~54ms.

Verified in this PoC environment on 2026-05-31. Check the latest OpenSearch Serverless documentation and regional availability before deployment.

Governance: Lake Formation Access Control

Step 1: Authorized query    → ✅ SUCCEEDED (3 rows)
Step 2: Revoke SELECT       → 🔒 BLOCKED (access denied)
Step 3: Restore SELECT      → ✅ SUCCEEDED
Step 4: CloudTrail audit    → All queries logged with user identity

Metadata queries are governed and audited. Raw file access remains governed separately by FSx file-system permissions, S3 Access Point policies, and application access paths.

Cost Analysis

This Demo

Component	Cost
Bedrock AI (5 files)	$0.05
OpenSearch (~6 min)	$0.024
Lambda + Athena	$0.001
Total	$0.07

Projected Monthly (10TB, 100K files, 1000 changes/day)

Component	Monthly
S3 Tables (metadata)	$5
Lambda (sync + AI)	$36
Bedrock (AI enrichment)	$30
OpenSearch (business hours)	$42
SQS + misc	$1
Total	$114/month
S3 copy eliminated	-$230-256/month

Net effect: The AI-powered catalog costs less than the S3 copy it eliminates.

Without AI enrichment (metadata scan + Athena only): ~$42/month. AI processing is optional and can be enabled per-file-type.

S3 Standard pricing: us-east-1 $0.023/GB, ap-northeast-1 $0.025/GB. Verified 2026-06-01 via AWS Pricing API.

For reproducibility, see: evidence-record.yaml, cost-assumptions.yaml, comprehensive-test-results.yaml

Known Limitations (Honest Assessment)

Limitation	Impact	Workaround
Databricks SQL Warehouse `CREATE CONNECTION TYPE iceberg_rest` to S3 Tables REST failed in this validation (2026-05-31)	SQL Warehouse direct path unavailable in tested method	Retested 2026-06-09; still blocked in tested UC path. Use curated metadata sync to UC Delta as practical workaround; support case submitted.
Databricks Spark cluster: UC blocks external catalog registration (2026-06-01)	Cannot use spark.conf.set or cluster config for external Iceberg catalogs	UC Foreign Catalog tested 2026-06-09 — External Location validation fails against S3 Tables internal bucket. Sync metadata to UC Delta table instead.
Databricks Delta Sharing: cannot bypass S3 AP session policy (2026-06-01)	Sharing server uses same UC credentials	DataSync → S3 → UC → Delta Sharing works for copied data; validate target table format and catalog support separately
Databricks NFS mount: cannot register as UC External Volume (2026-06-01)	NFS/FUSE paths not supported for UC Volumes	DataSync → S3 → UC External Location; internal feature request exists
Snowflake External Iceberg Table with S3 Tables REST endpoint was not a supported catalog type in this validation (2026-05-31)	Direct S3 Tables REST path unavailable in tested method	✅ Resolved (2026-06-05): Use Glue REST + explicit `ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS`. Schema must have no default External Volume. AWS prerequisite: `register-resource --with-federation`. Lake Formation column-level filtering NOT enforced via this path.
LF column exclusion grant failed in tested S3 Tables federated catalog path	Can't hide specific columns via tested grant pattern	Athena Views; track AWS support status
At 40 files, ListObjectsV2 is faster than Athena	Architecture value is at scale (100K+)	Expected — Athena has cold start overhead

Naming note: Use lowercase table, namespace, and column names for S3 Tables integrated with AWS analytics services. Mixed-case names may not be visible to Athena / Glue / Lake Formation. See S3 Tables naming rules.

Performance Boundaries Not Yet Validated

This PoC validates the architecture shape, not production scale limits. The following require separate testing:

FSx throughput impact under concurrent NFS/SMB/S3 access
S3 Access Point metadata operation impact under large namespace scans
S3 API request concurrency vs FSx provisioned throughput capacity
Impact of scan jobs on production SMB/NFS latency
ListObjectsV2 pagination behavior at 1M+ files
Lambda concurrency and S3 AP request throttling
Iceberg manifest growth and compaction behavior
Athena query latency with high snapshot counts
OpenSearch indexing throughput during bulk backfill
File size distribution and small-file amplification effects
Cold vs warm namespace access behavior (capacity pool reads during backfill)

ONTAP Object Model Mapping

ONTAP / FSx object	Role in this pattern
FSx file system	Performance / HA boundary
SVM	Protocol and administrative boundary
Volume	Catalog scope and S3 Access Point attachment target
Junction path / SMB share	Existing application namespace
S3 Access Point	S3 API boundary for AI/analytics (with associated file-system identity)
Iceberg table	Metadata catalog, not raw data store

Each S3 Access Point has an associated OntapFileSystemIdentity (UNIX UID/GID or Windows domain user) that authorizes all file access through that AP. IAM policy is evaluated first, then ONTAP file-system permissions. See security/s3-access-point-identity-matrix.yaml.

Iceberg Table Maintenance Plan

For production, define:

Snapshot retention period and table maintenance behavior — verify S3 Tables service-managed policies and any configurable retention settings
Manifest rewrite cadence (if metadata table grows large)
Orphan file cleanup policy
Deduplication view or materialized latest-record table
Time travel retention policy
Athena engine version and Iceberg version compatibility
Append-only dedup query as default named query for analysts

For operational steps, see ops/iceberg-maintenance-runbook.md. For details on Iceberg spec vs S3 Tables service behavior, see docs/standards-vs-service-behavior.md.

Iceberg does not enforce primary-key uniqueness in this PoC. Consumers should query curated latest-record views instead of the append-only base table. See ops/athena-named-queries/latest_records.sql in the repo.

Apache Iceberg is the open table format. Amazon S3 Tables is an AWS managed table bucket service that uses Apache Iceberg. Some operational behavior, endpoint support, and governance integration are AWS service-specific and should be validated separately from the Iceberg specification itself.

File Identity Strategy

file_id method	Best for	Tradeoff
`hash(volume_id + normalized_path)`	General purpose	Rename = new file_id
`hash(volume_id + file_handle/inode)`	Rename tracking	Requires inode access
Content hash (SHA-256)	Immutable documents	Expensive for large files
`path + last_modified + size`	Lightweight PoC only	Fragile under overwrites

Production should define how rename, overwrite, delete, and permission changes are represented in the metadata table.

Recommended production columns: source_system_id, volume_id, normalized_path, path_hash, content_hash, scan_run_id, change_type (created / modified / deleted / renamed / permission_changed).

For FlexClone-based dev/test datasets, decide whether cloned files should retain lineage to source files. If lineage matters, store clone_parent_volume_id, clone_parent_snapshot_id, and catalog_environment (prod / dev / test / dr). See dr/snapmirror-catalog-rebinding.md for DR failover considerations.

For manufacturing and engineering workloads, see schema/extensions/manufacturing_metadata.yaml for domain-specific metadata fields such as part number, revision, plant, machine, and inspection lot.

Multi-Tenant Deployment Considerations

If this pattern is provided by a partner or platform team to multiple business units or customers, define the isolation boundary explicitly.

Isolation model	Recommended when	Tradeoff
Table bucket per tenant	Strong isolation required	Higher operational overhead
Namespace per tenant	Balanced isolation and operations	Shared table bucket governance required
tenant_id column in one table	Internal multi-BU catalog	Requires strict LF-Tags / row filters
OpenSearch index per tenant	Search isolation required	More index management
Shared OpenSearch index + tenant filter	Lower cost	Must enforce filter in every query path

For partner-led deployments, document tenant onboarding automation, offboarding deletion/retention policy, per-tenant cost allocation tags, and audit evidence location.

Business KPI Mapping

Business problem	Baseline metric	Target metric	How this PoC measures it
Employees cannot find documents	Average search time	< 10 sec	Search latency + result relevance
Manual classification is slow	Files classified/day/person	10x improvement	AI enrichment throughput
Sensitive files are unknown	% files classified for PII	95%+ coverage target	PII scan completion rate
Duplicate S3 copy is costly	Monthly duplicate storage cost	Reduce by 50%+	Metadata-only architecture cost
AI projects lack data inventory	Dataset discovery lead time	Days → hours	Catalog completeness
Business users need governed discovery	% searchable assets in BI/AI tools	80%+ of approved metadata visible	Expose curated metadata views to Athena, Databricks, Snowflake, or BI tools

Try It Yourself

FSx for ONTAP prerequisites:

SVM and volume selected as catalog scope
S3 Access Point attached to the target volume
Associated UNIX or Windows identity documented
NFS/SMB production workload impact reviewed
CloudWatch metrics dashboard enabled

# Clone the repo
git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations/integrations/iceberg-metadata-catalog

# Install dependencies
pip install -r requirements.txt

# Run the demo (requires FSx for ONTAP with S3 Access Point)
cd demo/scripts
./run-demo.sh --ap-alias <your-ap-alias-ext-s3alias>

Don't have FSx for ONTAP? You can still explore the architecture:

What's Next

This is Part 1 of a 3-part series:

Part 1 (this article): Architecture & PoC Results
Part 2: AI Enrichment Pipeline — Bedrock Vision + Titan Embeddings + OpenSearch NextGen
Part 3: Governance & Cross-Platform Access — Lake Formation, PII Anonymization, Databricks/Snowflake Integration

Key Takeaways

Don't copy data to make it searchable — catalog the metadata instead. Apache Iceberg + S3 Tables gives you a managed metadata layer with time travel.
Selective AI enrichment plus scale-to-zero search can keep PoC and low-traffic environments cost-efficient — compute/search components idle near $0; persistent metadata and logs may incur small charges.
42 seconds, $0.07 — that's the barrier to entry for an AI-powered data catalog on your existing NAS storage.
Start small, grow incrementally — from metadata-only scan (Level 1) to full business workflow integration (Level 5). See the Production Maturity Model for the progression path.

All code and documentation is available at github.com/Yoshiki0705/fsxn-lakehouse-integrations. Feedback welcome via GitHub Issues.

DEV Community