DEV Community

Cover image for From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

What Works Now vs What Requires Validation

This article separates verified AWS-native capabilities from cross-platform paths that still require validation. The core pattern — keeping raw files on FSx for ONTAP and cataloging only metadata in S3 Tables — is verified. Databricks paths are still evolving. Snowflake Glue REST + VENDED_CREDENTIALS and External Stage paths are verified in this PoC, with governance limitations noted below. Validate all cross-platform paths in your own environment before production use.

Component Status Notes
AWS Native PoC (Athena + S3 Tables + Bedrock + OpenSearch + Lake Formation) ✅ Verified Full end-to-end in 42 seconds
Glue Iceberg REST endpoint access ✅ Verified Both S3 Tables REST and Glue REST confirmed
Lake Formation table-level governance ✅ Verified Grant/revoke/audit working
Lake Formation column-level exclusion ⚠️ Observed limitation Failed on tested federated catalog path
Databricks SQL Warehouse direct ⚠️ Observed limitation iceberg_rest connection type not supported
Databricks Spark + Iceberg REST ❌ Blocked by UC spark.conf.set and cluster config both fail; UC Foreign Catalog required
Databricks UC Foreign Catalog ❌ Still blocked Retested post-Foreign Iceberg GA (2026-06-09): Glue Connection ✅, Credentials ✅, but External Location fails — S3 Tables internal bucket rejects standard S3 API validation. No bypass available.
Databricks Delta Sharing via S3 AP ❌ Confirmed Sharing server uses same UC credentials; not a workaround for S3 AP session policy
Databricks NFS → UC Volume ❌ Confirmed Cloud storage URIs only; internal feature request exists
Databricks UC audit logging ✅ Confirmed External engine access fully logged
Snowflake via Glue REST (VENDED_CREDENTIALS) ✅ Verified Explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS; CREATE TABLE + SELECT + COUNT + AUTO_REFRESH all working (2026-06-05)
Snowflake External Stage (FSx S3 AP) ✅ Verified LIST, SELECT/COPY, and TO_FILE + Cortex AI all verified

Important distinction: This pattern does not use FSx for ONTAP S3 Access Points as an Iceberg warehouse. Raw files stay on FSx for ONTAP, while only the metadata catalog is written to S3 Tables. Direct Iceberg table writes to FSx for ONTAP S3 Access Points are tracked separately as a known limitation because Iceberg commit behavior and S3FileIO compatibility require additional validation.

This is an Iceberg Adoption Pattern, Not a Raw-Data Migration

This pattern does not convert the original unstructured files into Iceberg table data. Instead, it adopts Iceberg for the metadata layer only.

Scope What happens
Data files Not migrated. Raw files remain on FSx for ONTAP.
Metadata table Newly created as an Iceberg table on S3 Tables.
Processing jobs Metadata scan and AI enrichment jobs write append-only metadata.
Consumers Athena, EMR, Snowflake, Databricks, and BI/search tools consume curated metadata views.

Storage Boundary: What Moves and What Doesn't

FSx for ONTAP S3 Access Point:
  ✅ Raw file READ path only (AI enrichment input)
  ❌ NOT an Iceberg warehouse
  ❌ NOT a table commit target
  ❌ NOT bulk-copied to S3

S3 Tables:
  ✅ Iceberg METADATA table (file catalog)
  ✅ Metadata source of truth
  ✅ Query and governance target
Enter fullscreen mode Exit fullscreen mode

Data movement disclosure (for regulated environments): Raw files are NOT bulk-copied to S3. However, during AI enrichment, selected file content is temporarily read via the S3 Access Point and sent to Amazon Bedrock APIs for classification/embedding. Per AWS Bedrock data protection policy, model providers have no access to customer prompts or completions. Extracted/redacted metadata and embeddings are written to S3 Tables, OpenSearch, and optionally to Snowflake or Databricks depending on the activation path. Define your data flow boundary documentation before regulated-workload deployment.

The Problem: Most Enterprise Unstructured Data is Difficult to Discover and Govern

Most organizations store terabytes of unstructured data — PDFs, images, CAD files, sensor logs — on network-attached storage. This data is:

  • Undiscoverable: "Where is that invoice from last quarter?" requires manual searching or asking colleagues
  • Governed at the file-system layer, but not classified or searchable from analytics and AI workflows
  • Audit trails may exist at the file-system layer, but they are often not unified with analytics and AI query activity

Think of this as unstructured-data modernization: inventory first, classify selectively, govern metadata, and activate only what is needed — without bulk-copying the raw files.

Business Outcomes (Beyond Technical Metrics)

This pattern is not only about faster file search. It is about:

  • Reducing dataset discovery lead time for AI projects (days → hours)
  • Improving PII visibility across the organization (unknown → 95%+ coverage target)
  • Lowering duplicate storage cost ($230-256/month eliminated for 10TB)
  • Creating governed metadata products for analytics and AI teams
  • Enabling AI-readiness without raw-data copy or migration
  • Activating governed metadata in Snowflake AI Data Cloud for Cortex Search, semantic Q&A, executive dashboards, and business-facing file discovery

The traditional solution? Copy everything to S3 and build a catalog. But at 10TB, that's ~$230-256/month just for the copy — plus sync pipelines, duplicate governance, and data drift.

The Solution: Hot Metadata × Cold Data

What if we could catalog every file without moving it?

┌─────────────────────────────────────────────────────────┐
│  HOT: Metadata (Apache Iceberg on S3 Tables)            │
│  • File path, type, size, timestamps                    │
│  • AI classification + confidence score                 │
│  • Vector embedding (1024-dim, similarity search)       │
│  • PII detection flag                                   │
│  • Cost: ~$5-15/month for 100K files                    │
└────────────────────────┬────────────────────────────────┘
                         │ file_path reference
┌────────────────────────▼────────────────────────────────┐
│  COLD: Actual Files (FSx for ONTAP)                     │
│  • PDF, images, CAD, video, audio, logs                 │
│  • Deduplication (50-70% storage savings typical*)      │
│  • NFS/SMB (existing workflows) + S3 AP (AI/analytics)  │
│  • No bulk raw-data copy required                       │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key insight: Keep the data where it is. Move only the metadata into a queryable format.

Architecture

FSx for ONTAP ──S3 Access Point──→ AI Enrichment (Bedrock)
       │                                    │
       │                                    ▼
       │                          S3 Tables (Iceberg)
       │                                    │
       │                                    ▼
       │                          ┌──────────────────┐
       │                          │ Query Engines    │
       │                          │ • Athena (SQL)   │
       │                          │ • OpenSearch     │
       │                          │   (vector kNN)   │
       │                          │ • Lake Formation │
       │                          │   (governance)   │
       │                          └──────────────────┘
       │
       └──NFS/SMB──→ Existing applications (unchanged)
Enter fullscreen mode Exit fullscreen mode

Observability (production add-on):

       ┌──────────────────────────────────────┐
       │  • CloudWatch Metrics + Alarms       │
       │  • CloudWatch Logs (Lambda/SQS)      │
       │  • CloudTrail (governance audit)     │
       │  • OpenSearch Dashboards (search UX) │
       │  • FSx metrics (throughput, IOPS,    │
       │    latency, capacity pool reads)     │
       └──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Components:

Component Role Cost
FSx for ONTAP S3 Access Point Read files for AI processing (no copy) Included with FSx
S3 Tables AWS managed Apache Iceberg table service (auto-compaction, REST endpoint) ~$5/month metadata
Bedrock Claude Vision Image classification ~$0.01/file in this demo
Titan Embeddings V2 1024-dim vectors for similarity search $0.00002/1K input tokens
OpenSearch Serverless NextGen kNN vector search (scale-to-zero) $0 idle compute when inactive
Lake Formation Metadata access governance No additional Lake Formation charge

S3 Tables Iceberg REST endpoint: https://s3tables.<region>.amazonaws.com/iceberg
Check S3 Tables availability for regional support before deployment.

Deduplication ratio is a general ONTAP range. Actual savings depend on data characteristics and were not measured in this PoC.

PoC Results (Verified 2026-05-31)

We built and verified this end-to-end in a single day. Here's what we measured:

S3 Tables Access Paths: Which Endpoint Should You Use?

Access path Best for Governance path Verified
S3 Tables Iceberg REST (s3tables.<region>.amazonaws.com/iceberg) Direct Iceberg client / simple PoC IAM + S3 Tables permissions
AWS Glue Iceberg REST (glue.<region>.amazonaws.com/iceberg) Production analytics integration IAM + Lake Formation
Athena via Glue federated catalog SQL analytics Lake Formation + Athena
PyIceberg local client Lightweight validation IAM/LF depending on endpoint

For production workloads with centralized governance, the AWS Glue Iceberg REST endpoint is recommended over the S3 Tables direct endpoint. See AWS docs.

Catalog authority rule: S3 Tables + Glue is the authoritative catalog for this metadata table in this PoC. Other engines should consume the table through the authoritative catalog or a controlled metadata activation path. Do not configure multiple writable catalogs for the same Iceberg table — dual-write causes split-brain and potential data corruption.

Athena Iceberg behavior depends on Athena engine version, Iceberg version, Glue/Lake Formation integration, and table maintenance state. Validate DDL/DML requirements separately before using this as a write-heavy production catalog.

Verification details are recorded in evidence-record.yaml and cross-platform-compatibility.yaml.

Before vs After

Metric Before After Improvement
File discovery time Minutes-hours < 2 seconds 100x+ at scale
AI classification Manual Automatic (6 sec/file) Fully automated
Storage cost (10TB) ~$250/month (S3 copy) $5-15/month (metadata only) 95% reduction
Metadata query governance Not applicable 100% in this PoC Complete for metadata queries
Idle compute/search cost N/A Near $0 when inactive Persistent metadata/logs may still incur small charges

Search Time Scaling (Measured + Projected)

Files ListObjectsV2 Athena SQL Speedup
40 892 ms 3.0 sec 0.3x
1,000 22.3 sec 1.8 sec 12x
10,000 3.7 min 1.8 sec 124x
100,000 37.2 min 1.8 sec 1,239x
1,000,000 371.7 min 1.8 sec 12,389x

At 40 files, ListObjectsV2 is faster — Athena has cold start overhead. Athena query time does not scale linearly with the number of files on FSx because it queries the Iceberg metadata table instead of listing the raw file namespace. In this controlled demo, the query stayed around ~1.8 seconds for projected file counts, but production latency depends on Iceberg metadata size, manifest count, predicate selectivity, Athena cold start, and table maintenance state.

Projection method: ListObjectsV2 latency was extrapolated linearly from the measured 40-file scan. This is intentionally conservative for demonstrating namespace-scan behavior, but it is not a service benchmark.

The 42-Second Demo

Our complete demo runs all 8 steps in 42 seconds:

asciicast

Step 1: Before/After search comparison     ✅ (ListObjectsV2 vs Athena)
Step 2: Infrastructure deploy              ✅ (CloudFormation, skippable)
Step 3: Metadata scan (40 files)           ✅ (3 seconds)
Step 4: AI enrichment (Bedrock Vision)     ✅ (invoice → 0.95 confidence)
Step 5: Athena query + Time Travel         ✅ (< 2 seconds)
Step 6: Vector similarity search           ✅ (kNN score 0.67)
Step 7: PII detection + anonymization      ✅ (7/7 entities, all redacted)
Step 8: Cost & ROI analysis                ✅ ($0.07 total demo cost)
Enter fullscreen mode Exit fullscreen mode

Total demo cost: $0.07. After the demo, the compute/search components can scale to zero. If you retain S3 Tables metadata, logs, or audit trails, small storage/logging charges may still apply.

AI Classification Results

File Classification Confidence
invoice_sample.png Invoice 0.95
product_inspection.png Pie Chart 1.0
sensor_dashboard.png IoT Sensor Dashboard 0.9

In this demo, Bedrock Claude Vision classified sample images at roughly $0.01/file with sub-10-second latency. Production cost and latency depend on image size, prompt length, model version, and retry behavior.

Vector Similarity Search

Query: "find invoice or payment documents"
→ invoice_sample.png (score: 0.6749)
Enter fullscreen mode Exit fullscreen mode

OpenSearch Serverless with scale-to-zero capability (GA May 2026) provides kNN search — no minimum cost when idle. Cold start is ~10-30 seconds, warm queries are ~54ms.

Verified in this PoC environment on 2026-05-31. Check the latest OpenSearch Serverless documentation and regional availability before deployment.

Governance: Lake Formation Access Control

Step 1: Authorized query    → ✅ SUCCEEDED (3 rows)
Step 2: Revoke SELECT       → 🔒 BLOCKED (access denied)
Step 3: Restore SELECT      → ✅ SUCCEEDED
Step 4: CloudTrail audit    → All queries logged with user identity
Enter fullscreen mode Exit fullscreen mode

Metadata queries are governed and audited. Raw file access remains governed separately by FSx file-system permissions, S3 Access Point policies, and application access paths.

Cost Analysis

This Demo

Component Cost
Bedrock AI (5 files) $0.05
OpenSearch (~6 min) $0.024
Lambda + Athena $0.001
Total $0.07

Projected Monthly (10TB, 100K files, 1000 changes/day)

Component Monthly
S3 Tables (metadata) $5
Lambda (sync + AI) $36
Bedrock (AI enrichment) $30
OpenSearch (business hours) $42
SQS + misc $1
Total $114/month
S3 copy eliminated -$230-256/month

Net effect: The AI-powered catalog costs less than the S3 copy it eliminates.

Without AI enrichment (metadata scan + Athena only): ~$42/month. AI processing is optional and can be enabled per-file-type.

S3 Standard pricing: us-east-1 $0.023/GB, ap-northeast-1 $0.025/GB. Verified 2026-06-01 via AWS Pricing API.

For reproducibility, see: evidence-record.yaml, cost-assumptions.yaml, comprehensive-test-results.yaml

Known Limitations (Honest Assessment)

Limitation Impact Workaround
Databricks SQL Warehouse CREATE CONNECTION TYPE iceberg_rest to S3 Tables REST failed in this validation (2026-05-31) SQL Warehouse direct path unavailable in tested method Retested 2026-06-09; still blocked in tested UC path. Use curated metadata sync to UC Delta as practical workaround; support case submitted.
Databricks Spark cluster: UC blocks external catalog registration (2026-06-01) Cannot use spark.conf.set or cluster config for external Iceberg catalogs UC Foreign Catalog tested 2026-06-09 — External Location validation fails against S3 Tables internal bucket. Sync metadata to UC Delta table instead.
Databricks Delta Sharing: cannot bypass S3 AP session policy (2026-06-01) Sharing server uses same UC credentials DataSync → S3 → UC → Delta Sharing works for copied data; validate target table format and catalog support separately
Databricks NFS mount: cannot register as UC External Volume (2026-06-01) NFS/FUSE paths not supported for UC Volumes DataSync → S3 → UC External Location; internal feature request exists
Snowflake External Iceberg Table with S3 Tables REST endpoint was not a supported catalog type in this validation (2026-05-31) Direct S3 Tables REST path unavailable in tested method Resolved (2026-06-05): Use Glue REST + explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS. Schema must have no default External Volume. AWS prerequisite: register-resource --with-federation. Lake Formation column-level filtering NOT enforced via this path.
LF column exclusion grant failed in tested S3 Tables federated catalog path Can't hide specific columns via tested grant pattern Athena Views; track AWS support status
At 40 files, ListObjectsV2 is faster than Athena Architecture value is at scale (100K+) Expected — Athena has cold start overhead

Naming note: Use lowercase table, namespace, and column names for S3 Tables integrated with AWS analytics services. Mixed-case names may not be visible to Athena / Glue / Lake Formation. See S3 Tables naming rules.

Performance Boundaries Not Yet Validated

This PoC validates the architecture shape, not production scale limits. The following require separate testing:

  • FSx throughput impact under concurrent NFS/SMB/S3 access
  • S3 Access Point metadata operation impact under large namespace scans
  • S3 API request concurrency vs FSx provisioned throughput capacity
  • Impact of scan jobs on production SMB/NFS latency
  • ListObjectsV2 pagination behavior at 1M+ files
  • Lambda concurrency and S3 AP request throttling
  • Iceberg manifest growth and compaction behavior
  • Athena query latency with high snapshot counts
  • OpenSearch indexing throughput during bulk backfill
  • File size distribution and small-file amplification effects
  • Cold vs warm namespace access behavior (capacity pool reads during backfill)

ONTAP Object Model Mapping

ONTAP / FSx object Role in this pattern
FSx file system Performance / HA boundary
SVM Protocol and administrative boundary
Volume Catalog scope and S3 Access Point attachment target
Junction path / SMB share Existing application namespace
S3 Access Point S3 API boundary for AI/analytics (with associated file-system identity)
Iceberg table Metadata catalog, not raw data store

Each S3 Access Point has an associated OntapFileSystemIdentity (UNIX UID/GID or Windows domain user) that authorizes all file access through that AP. IAM policy is evaluated first, then ONTAP file-system permissions. See security/s3-access-point-identity-matrix.yaml.

Iceberg Table Maintenance Plan

For production, define:

  • Snapshot retention period and table maintenance behavior — verify S3 Tables service-managed policies and any configurable retention settings
  • Manifest rewrite cadence (if metadata table grows large)
  • Orphan file cleanup policy
  • Deduplication view or materialized latest-record table
  • Time travel retention policy
  • Athena engine version and Iceberg version compatibility
  • Append-only dedup query as default named query for analysts

For operational steps, see ops/iceberg-maintenance-runbook.md. For details on Iceberg spec vs S3 Tables service behavior, see docs/standards-vs-service-behavior.md.

Iceberg does not enforce primary-key uniqueness in this PoC. Consumers should query curated latest-record views instead of the append-only base table. See ops/athena-named-queries/latest_records.sql in the repo.

Apache Iceberg is the open table format. Amazon S3 Tables is an AWS managed table bucket service that uses Apache Iceberg. Some operational behavior, endpoint support, and governance integration are AWS service-specific and should be validated separately from the Iceberg specification itself.

File Identity Strategy

file_id method Best for Tradeoff
hash(volume_id + normalized_path) General purpose Rename = new file_id
hash(volume_id + file_handle/inode) Rename tracking Requires inode access
Content hash (SHA-256) Immutable documents Expensive for large files
path + last_modified + size Lightweight PoC only Fragile under overwrites

Production should define how rename, overwrite, delete, and permission changes are represented in the metadata table.

Recommended production columns: source_system_id, volume_id, normalized_path, path_hash, content_hash, scan_run_id, change_type (created / modified / deleted / renamed / permission_changed).

For FlexClone-based dev/test datasets, decide whether cloned files should retain lineage to source files. If lineage matters, store clone_parent_volume_id, clone_parent_snapshot_id, and catalog_environment (prod / dev / test / dr). See dr/snapmirror-catalog-rebinding.md for DR failover considerations.

For manufacturing and engineering workloads, see schema/extensions/manufacturing_metadata.yaml for domain-specific metadata fields such as part number, revision, plant, machine, and inspection lot.

Multi-Tenant Deployment Considerations

If this pattern is provided by a partner or platform team to multiple business units or customers, define the isolation boundary explicitly.

Isolation model Recommended when Tradeoff
Table bucket per tenant Strong isolation required Higher operational overhead
Namespace per tenant Balanced isolation and operations Shared table bucket governance required
tenant_id column in one table Internal multi-BU catalog Requires strict LF-Tags / row filters
OpenSearch index per tenant Search isolation required More index management
Shared OpenSearch index + tenant filter Lower cost Must enforce filter in every query path

For partner-led deployments, document tenant onboarding automation, offboarding deletion/retention policy, per-tenant cost allocation tags, and audit evidence location.

Business KPI Mapping

Business problem Baseline metric Target metric How this PoC measures it
Employees cannot find documents Average search time < 10 sec Search latency + result relevance
Manual classification is slow Files classified/day/person 10x improvement AI enrichment throughput
Sensitive files are unknown % files classified for PII 95%+ coverage target PII scan completion rate
Duplicate S3 copy is costly Monthly duplicate storage cost Reduce by 50%+ Metadata-only architecture cost
AI projects lack data inventory Dataset discovery lead time Days → hours Catalog completeness
Business users need governed discovery % searchable assets in BI/AI tools 80%+ of approved metadata visible Expose curated metadata views to Athena, Databricks, Snowflake, or BI tools

Try It Yourself

FSx for ONTAP prerequisites:

  • SVM and volume selected as catalog scope
  • S3 Access Point attached to the target volume
  • Associated UNIX or Windows identity documented
  • NFS/SMB production workload impact reviewed
  • CloudWatch metrics dashboard enabled
# Clone the repo
git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations/integrations/iceberg-metadata-catalog

# Install dependencies
pip install -r requirements.txt

# Run the demo (requires FSx for ONTAP with S3 Access Point)
cd demo/scripts
./run-demo.sh --ap-alias <your-ap-alias-ext-s3alias>
Enter fullscreen mode Exit fullscreen mode

Don't have FSx for ONTAP? You can still explore the architecture:

What's Next

This is Part 1 of a 3-part series:

  • Part 1 (this article): Architecture & PoC Results
  • Part 2: AI Enrichment Pipeline — Bedrock Vision + Titan Embeddings + OpenSearch NextGen
  • Part 3: Governance & Cross-Platform Access — Lake Formation, PII Anonymization, Databricks/Snowflake Integration

Key Takeaways

  1. Don't copy data to make it searchable — catalog the metadata instead. Apache Iceberg + S3 Tables gives you a managed metadata layer with time travel.
  2. Selective AI enrichment plus scale-to-zero search can keep PoC and low-traffic environments cost-efficient — compute/search components idle near $0; persistent metadata and logs may incur small charges.
  3. 42 seconds, $0.07 — that's the barrier to entry for an AI-powered data catalog on your existing NAS storage.
  4. Start small, grow incrementally — from metadata-only scan (Level 1) to full business workflow integration (Level 5). See the Production Maturity Model for the progression path.

All code and documentation is available at github.com/Yoshiki0705/fsxn-lakehouse-integrations. Feedback welcome via GitHub Issues.

Top comments (0)