What Works Now vs What Requires Validation
This article separates verified AWS-native capabilities from cross-platform paths that still require validation. The core pattern — keeping raw files on FSx for ONTAP and cataloging only metadata in S3 Tables — is verified. Databricks paths are still evolving. Snowflake Glue REST + VENDED_CREDENTIALS and External Stage paths are verified in this PoC, with governance limitations noted below. Validate all cross-platform paths in your own environment before production use.
| Component | Status | Notes |
|---|---|---|
| AWS Native PoC (Athena + S3 Tables + Bedrock + OpenSearch + Lake Formation) | ✅ Verified | Full end-to-end in 42 seconds |
| Glue Iceberg REST endpoint access | ✅ Verified | Both S3 Tables REST and Glue REST confirmed |
| Lake Formation table-level governance | ✅ Verified | Grant/revoke/audit working |
| Lake Formation column-level exclusion | ⚠️ Observed limitation | Failed on tested federated catalog path |
| Databricks SQL Warehouse direct | ⚠️ Observed limitation |
iceberg_rest connection type not supported |
| Databricks Spark + Iceberg REST | ❌ Blocked by UC | spark.conf.set and cluster config both fail; UC Foreign Catalog required |
| Databricks UC Foreign Catalog | ❌ Still blocked | Retested post-Foreign Iceberg GA (2026-06-09): Glue Connection ✅, Credentials ✅, but External Location fails — S3 Tables internal bucket rejects standard S3 API validation. No bypass available. |
| Databricks Delta Sharing via S3 AP | ❌ Confirmed | Sharing server uses same UC credentials; not a workaround for S3 AP session policy |
| Databricks NFS → UC Volume | ❌ Confirmed | Cloud storage URIs only; internal feature request exists |
| Databricks UC audit logging | ✅ Confirmed | External engine access fully logged |
| Snowflake via Glue REST (VENDED_CREDENTIALS) | ✅ Verified | Explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS; CREATE TABLE + SELECT + COUNT + AUTO_REFRESH all working (2026-06-05) |
| Snowflake External Stage (FSx S3 AP) | ✅ Verified | LIST, SELECT/COPY, and TO_FILE + Cortex AI all verified |
Important distinction: This pattern does not use FSx for ONTAP S3 Access Points as an Iceberg warehouse. Raw files stay on FSx for ONTAP, while only the metadata catalog is written to S3 Tables. Direct Iceberg table writes to FSx for ONTAP S3 Access Points are tracked separately as a known limitation because Iceberg commit behavior and S3FileIO compatibility require additional validation.
This is an Iceberg Adoption Pattern, Not a Raw-Data Migration
This pattern does not convert the original unstructured files into Iceberg table data. Instead, it adopts Iceberg for the metadata layer only.
| Scope | What happens |
|---|---|
| Data files | Not migrated. Raw files remain on FSx for ONTAP. |
| Metadata table | Newly created as an Iceberg table on S3 Tables. |
| Processing jobs | Metadata scan and AI enrichment jobs write append-only metadata. |
| Consumers | Athena, EMR, Snowflake, Databricks, and BI/search tools consume curated metadata views. |
Storage Boundary: What Moves and What Doesn't
FSx for ONTAP S3 Access Point:
✅ Raw file READ path only (AI enrichment input)
❌ NOT an Iceberg warehouse
❌ NOT a table commit target
❌ NOT bulk-copied to S3
S3 Tables:
✅ Iceberg METADATA table (file catalog)
✅ Metadata source of truth
✅ Query and governance target
Data movement disclosure (for regulated environments): Raw files are NOT bulk-copied to S3. However, during AI enrichment, selected file content is temporarily read via the S3 Access Point and sent to Amazon Bedrock APIs for classification/embedding. Per AWS Bedrock data protection policy, model providers have no access to customer prompts or completions. Extracted/redacted metadata and embeddings are written to S3 Tables, OpenSearch, and optionally to Snowflake or Databricks depending on the activation path. Define your data flow boundary documentation before regulated-workload deployment.
The Problem: Most Enterprise Unstructured Data is Difficult to Discover and Govern
Most organizations store terabytes of unstructured data — PDFs, images, CAD files, sensor logs — on network-attached storage. This data is:
- Undiscoverable: "Where is that invoice from last quarter?" requires manual searching or asking colleagues
- Governed at the file-system layer, but not classified or searchable from analytics and AI workflows
- Audit trails may exist at the file-system layer, but they are often not unified with analytics and AI query activity
Think of this as unstructured-data modernization: inventory first, classify selectively, govern metadata, and activate only what is needed — without bulk-copying the raw files.
Business Outcomes (Beyond Technical Metrics)
This pattern is not only about faster file search. It is about:
- Reducing dataset discovery lead time for AI projects (days → hours)
- Improving PII visibility across the organization (unknown → 95%+ coverage target)
- Lowering duplicate storage cost ($230-256/month eliminated for 10TB)
- Creating governed metadata products for analytics and AI teams
- Enabling AI-readiness without raw-data copy or migration
- Activating governed metadata in Snowflake AI Data Cloud for Cortex Search, semantic Q&A, executive dashboards, and business-facing file discovery
The traditional solution? Copy everything to S3 and build a catalog. But at 10TB, that's ~$230-256/month just for the copy — plus sync pipelines, duplicate governance, and data drift.
The Solution: Hot Metadata × Cold Data
What if we could catalog every file without moving it?
┌─────────────────────────────────────────────────────────┐
│ HOT: Metadata (Apache Iceberg on S3 Tables) │
│ • File path, type, size, timestamps │
│ • AI classification + confidence score │
│ • Vector embedding (1024-dim, similarity search) │
│ • PII detection flag │
│ • Cost: ~$5-15/month for 100K files │
└────────────────────────┬────────────────────────────────┘
│ file_path reference
┌────────────────────────▼────────────────────────────────┐
│ COLD: Actual Files (FSx for ONTAP) │
│ • PDF, images, CAD, video, audio, logs │
│ • Deduplication (50-70% storage savings typical*) │
│ • NFS/SMB (existing workflows) + S3 AP (AI/analytics) │
│ • No bulk raw-data copy required │
└─────────────────────────────────────────────────────────┘
Key insight: Keep the data where it is. Move only the metadata into a queryable format.
Architecture
FSx for ONTAP ──S3 Access Point──→ AI Enrichment (Bedrock)
│ │
│ ▼
│ S3 Tables (Iceberg)
│ │
│ ▼
│ ┌──────────────────┐
│ │ Query Engines │
│ │ • Athena (SQL) │
│ │ • OpenSearch │
│ │ (vector kNN) │
│ │ • Lake Formation │
│ │ (governance) │
│ └──────────────────┘
│
└──NFS/SMB──→ Existing applications (unchanged)
Observability (production add-on):
┌──────────────────────────────────────┐
│ • CloudWatch Metrics + Alarms │
│ • CloudWatch Logs (Lambda/SQS) │
│ • CloudTrail (governance audit) │
│ • OpenSearch Dashboards (search UX) │
│ • FSx metrics (throughput, IOPS, │
│ latency, capacity pool reads) │
└──────────────────────────────────────┘
Components:
| Component | Role | Cost |
|---|---|---|
| FSx for ONTAP S3 Access Point | Read files for AI processing (no copy) | Included with FSx |
| S3 Tables | AWS managed Apache Iceberg table service (auto-compaction, REST endpoint) | ~$5/month metadata |
| Bedrock Claude Vision | Image classification | ~$0.01/file in this demo |
| Titan Embeddings V2 | 1024-dim vectors for similarity search | $0.00002/1K input tokens |
| OpenSearch Serverless NextGen | kNN vector search (scale-to-zero) | $0 idle compute when inactive |
| Lake Formation | Metadata access governance | No additional Lake Formation charge |
S3 Tables Iceberg REST endpoint:
https://s3tables.<region>.amazonaws.com/iceberg
Check S3 Tables availability for regional support before deployment.Deduplication ratio is a general ONTAP range. Actual savings depend on data characteristics and were not measured in this PoC.
PoC Results (Verified 2026-05-31)
We built and verified this end-to-end in a single day. Here's what we measured:
S3 Tables Access Paths: Which Endpoint Should You Use?
| Access path | Best for | Governance path | Verified |
|---|---|---|---|
S3 Tables Iceberg REST (s3tables.<region>.amazonaws.com/iceberg) |
Direct Iceberg client / simple PoC | IAM + S3 Tables permissions | ✅ |
AWS Glue Iceberg REST (glue.<region>.amazonaws.com/iceberg) |
Production analytics integration | IAM + Lake Formation | ✅ |
| Athena via Glue federated catalog | SQL analytics | Lake Formation + Athena | ✅ |
| PyIceberg local client | Lightweight validation | IAM/LF depending on endpoint | ✅ |
For production workloads with centralized governance, the AWS Glue Iceberg REST endpoint is recommended over the S3 Tables direct endpoint. See AWS docs.
Catalog authority rule: S3 Tables + Glue is the authoritative catalog for this metadata table in this PoC. Other engines should consume the table through the authoritative catalog or a controlled metadata activation path. Do not configure multiple writable catalogs for the same Iceberg table — dual-write causes split-brain and potential data corruption.
Athena Iceberg behavior depends on Athena engine version, Iceberg version, Glue/Lake Formation integration, and table maintenance state. Validate DDL/DML requirements separately before using this as a write-heavy production catalog.
Verification details are recorded in evidence-record.yaml and cross-platform-compatibility.yaml.
Before vs After
| Metric | Before | After | Improvement |
|---|---|---|---|
| File discovery time | Minutes-hours | < 2 seconds | 100x+ at scale |
| AI classification | Manual | Automatic (6 sec/file) | Fully automated |
| Storage cost (10TB) | ~$250/month (S3 copy) | $5-15/month (metadata only) | 95% reduction |
| Metadata query governance | Not applicable | 100% in this PoC | Complete for metadata queries |
| Idle compute/search cost | N/A | Near $0 when inactive | Persistent metadata/logs may still incur small charges |
Search Time Scaling (Measured + Projected)
| Files | ListObjectsV2 | Athena SQL | Speedup |
|---|---|---|---|
| 40 | 892 ms | 3.0 sec | 0.3x |
| 1,000 | 22.3 sec | 1.8 sec | 12x |
| 10,000 | 3.7 min | 1.8 sec | 124x |
| 100,000 | 37.2 min | 1.8 sec | 1,239x |
| 1,000,000 | 371.7 min | 1.8 sec | 12,389x |
At 40 files, ListObjectsV2 is faster — Athena has cold start overhead. Athena query time does not scale linearly with the number of files on FSx because it queries the Iceberg metadata table instead of listing the raw file namespace. In this controlled demo, the query stayed around ~1.8 seconds for projected file counts, but production latency depends on Iceberg metadata size, manifest count, predicate selectivity, Athena cold start, and table maintenance state.
Projection method: ListObjectsV2 latency was extrapolated linearly from the measured 40-file scan. This is intentionally conservative for demonstrating namespace-scan behavior, but it is not a service benchmark.
The 42-Second Demo
Our complete demo runs all 8 steps in 42 seconds:
Step 1: Before/After search comparison ✅ (ListObjectsV2 vs Athena)
Step 2: Infrastructure deploy ✅ (CloudFormation, skippable)
Step 3: Metadata scan (40 files) ✅ (3 seconds)
Step 4: AI enrichment (Bedrock Vision) ✅ (invoice → 0.95 confidence)
Step 5: Athena query + Time Travel ✅ (< 2 seconds)
Step 6: Vector similarity search ✅ (kNN score 0.67)
Step 7: PII detection + anonymization ✅ (7/7 entities, all redacted)
Step 8: Cost & ROI analysis ✅ ($0.07 total demo cost)
Total demo cost: $0.07. After the demo, the compute/search components can scale to zero. If you retain S3 Tables metadata, logs, or audit trails, small storage/logging charges may still apply.
AI Classification Results
| File | Classification | Confidence |
|---|---|---|
| invoice_sample.png | Invoice | 0.95 |
| product_inspection.png | Pie Chart | 1.0 |
| sensor_dashboard.png | IoT Sensor Dashboard | 0.9 |
In this demo, Bedrock Claude Vision classified sample images at roughly $0.01/file with sub-10-second latency. Production cost and latency depend on image size, prompt length, model version, and retry behavior.
Vector Similarity Search
Query: "find invoice or payment documents"
→ invoice_sample.png (score: 0.6749)
OpenSearch Serverless with scale-to-zero capability (GA May 2026) provides kNN search — no minimum cost when idle. Cold start is ~10-30 seconds, warm queries are ~54ms.
Verified in this PoC environment on 2026-05-31. Check the latest OpenSearch Serverless documentation and regional availability before deployment.
Governance: Lake Formation Access Control
Step 1: Authorized query → ✅ SUCCEEDED (3 rows)
Step 2: Revoke SELECT → 🔒 BLOCKED (access denied)
Step 3: Restore SELECT → ✅ SUCCEEDED
Step 4: CloudTrail audit → All queries logged with user identity
Metadata queries are governed and audited. Raw file access remains governed separately by FSx file-system permissions, S3 Access Point policies, and application access paths.
Cost Analysis
This Demo
| Component | Cost |
|---|---|
| Bedrock AI (5 files) | $0.05 |
| OpenSearch (~6 min) | $0.024 |
| Lambda + Athena | $0.001 |
| Total | $0.07 |
Projected Monthly (10TB, 100K files, 1000 changes/day)
| Component | Monthly |
|---|---|
| S3 Tables (metadata) | $5 |
| Lambda (sync + AI) | $36 |
| Bedrock (AI enrichment) | $30 |
| OpenSearch (business hours) | $42 |
| SQS + misc | $1 |
| Total | $114/month |
| S3 copy eliminated | -$230-256/month |
Net effect: The AI-powered catalog costs less than the S3 copy it eliminates.
Without AI enrichment (metadata scan + Athena only): ~$42/month. AI processing is optional and can be enabled per-file-type.
S3 Standard pricing: us-east-1 $0.023/GB, ap-northeast-1 $0.025/GB. Verified 2026-06-01 via AWS Pricing API.
For reproducibility, see: evidence-record.yaml, cost-assumptions.yaml, comprehensive-test-results.yaml
Known Limitations (Honest Assessment)
| Limitation | Impact | Workaround |
|---|---|---|
Databricks SQL Warehouse CREATE CONNECTION TYPE iceberg_rest to S3 Tables REST failed in this validation (2026-05-31) |
SQL Warehouse direct path unavailable in tested method | Retested 2026-06-09; still blocked in tested UC path. Use curated metadata sync to UC Delta as practical workaround; support case submitted. |
| Databricks Spark cluster: UC blocks external catalog registration (2026-06-01) | Cannot use spark.conf.set or cluster config for external Iceberg catalogs | UC Foreign Catalog tested 2026-06-09 — External Location validation fails against S3 Tables internal bucket. Sync metadata to UC Delta table instead. |
| Databricks Delta Sharing: cannot bypass S3 AP session policy (2026-06-01) | Sharing server uses same UC credentials | DataSync → S3 → UC → Delta Sharing works for copied data; validate target table format and catalog support separately |
| Databricks NFS mount: cannot register as UC External Volume (2026-06-01) | NFS/FUSE paths not supported for UC Volumes | DataSync → S3 → UC External Location; internal feature request exists |
| Snowflake External Iceberg Table with S3 Tables REST endpoint was not a supported catalog type in this validation (2026-05-31) | Direct S3 Tables REST path unavailable in tested method | ✅ Resolved (2026-06-05): Use Glue REST + explicit ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS. Schema must have no default External Volume. AWS prerequisite: register-resource --with-federation. Lake Formation column-level filtering NOT enforced via this path. |
| LF column exclusion grant failed in tested S3 Tables federated catalog path | Can't hide specific columns via tested grant pattern | Athena Views; track AWS support status |
| At 40 files, ListObjectsV2 is faster than Athena | Architecture value is at scale (100K+) | Expected — Athena has cold start overhead |
Naming note: Use lowercase table, namespace, and column names for S3 Tables integrated with AWS analytics services. Mixed-case names may not be visible to Athena / Glue / Lake Formation. See S3 Tables naming rules.
Performance Boundaries Not Yet Validated
This PoC validates the architecture shape, not production scale limits. The following require separate testing:
- FSx throughput impact under concurrent NFS/SMB/S3 access
- S3 Access Point metadata operation impact under large namespace scans
- S3 API request concurrency vs FSx provisioned throughput capacity
- Impact of scan jobs on production SMB/NFS latency
- ListObjectsV2 pagination behavior at 1M+ files
- Lambda concurrency and S3 AP request throttling
- Iceberg manifest growth and compaction behavior
- Athena query latency with high snapshot counts
- OpenSearch indexing throughput during bulk backfill
- File size distribution and small-file amplification effects
- Cold vs warm namespace access behavior (capacity pool reads during backfill)
ONTAP Object Model Mapping
| ONTAP / FSx object | Role in this pattern |
|---|---|
| FSx file system | Performance / HA boundary |
| SVM | Protocol and administrative boundary |
| Volume | Catalog scope and S3 Access Point attachment target |
| Junction path / SMB share | Existing application namespace |
| S3 Access Point | S3 API boundary for AI/analytics (with associated file-system identity) |
| Iceberg table | Metadata catalog, not raw data store |
Each S3 Access Point has an associated
OntapFileSystemIdentity(UNIX UID/GID or Windows domain user) that authorizes all file access through that AP. IAM policy is evaluated first, then ONTAP file-system permissions. Seesecurity/s3-access-point-identity-matrix.yaml.
Iceberg Table Maintenance Plan
For production, define:
- Snapshot retention period and table maintenance behavior — verify S3 Tables service-managed policies and any configurable retention settings
- Manifest rewrite cadence (if metadata table grows large)
- Orphan file cleanup policy
- Deduplication view or materialized latest-record table
- Time travel retention policy
- Athena engine version and Iceberg version compatibility
- Append-only dedup query as default named query for analysts
For operational steps, see ops/iceberg-maintenance-runbook.md. For details on Iceberg spec vs S3 Tables service behavior, see docs/standards-vs-service-behavior.md.
Iceberg does not enforce primary-key uniqueness in this PoC. Consumers should query curated latest-record views instead of the append-only base table. See
ops/athena-named-queries/latest_records.sqlin the repo.Apache Iceberg is the open table format. Amazon S3 Tables is an AWS managed table bucket service that uses Apache Iceberg. Some operational behavior, endpoint support, and governance integration are AWS service-specific and should be validated separately from the Iceberg specification itself.
File Identity Strategy
| file_id method | Best for | Tradeoff |
|---|---|---|
hash(volume_id + normalized_path) |
General purpose | Rename = new file_id |
hash(volume_id + file_handle/inode) |
Rename tracking | Requires inode access |
| Content hash (SHA-256) | Immutable documents | Expensive for large files |
path + last_modified + size |
Lightweight PoC only | Fragile under overwrites |
Production should define how rename, overwrite, delete, and permission changes are represented in the metadata table.
Recommended production columns: source_system_id, volume_id, normalized_path, path_hash, content_hash, scan_run_id, change_type (created / modified / deleted / renamed / permission_changed).
For FlexClone-based dev/test datasets, decide whether cloned files should retain lineage to source files. If lineage matters, store clone_parent_volume_id, clone_parent_snapshot_id, and catalog_environment (prod / dev / test / dr). See dr/snapmirror-catalog-rebinding.md for DR failover considerations.
For manufacturing and engineering workloads, see schema/extensions/manufacturing_metadata.yaml for domain-specific metadata fields such as part number, revision, plant, machine, and inspection lot.
Multi-Tenant Deployment Considerations
If this pattern is provided by a partner or platform team to multiple business units or customers, define the isolation boundary explicitly.
| Isolation model | Recommended when | Tradeoff |
|---|---|---|
| Table bucket per tenant | Strong isolation required | Higher operational overhead |
| Namespace per tenant | Balanced isolation and operations | Shared table bucket governance required |
| tenant_id column in one table | Internal multi-BU catalog | Requires strict LF-Tags / row filters |
| OpenSearch index per tenant | Search isolation required | More index management |
| Shared OpenSearch index + tenant filter | Lower cost | Must enforce filter in every query path |
For partner-led deployments, document tenant onboarding automation, offboarding deletion/retention policy, per-tenant cost allocation tags, and audit evidence location.
Business KPI Mapping
| Business problem | Baseline metric | Target metric | How this PoC measures it |
|---|---|---|---|
| Employees cannot find documents | Average search time | < 10 sec | Search latency + result relevance |
| Manual classification is slow | Files classified/day/person | 10x improvement | AI enrichment throughput |
| Sensitive files are unknown | % files classified for PII | 95%+ coverage target | PII scan completion rate |
| Duplicate S3 copy is costly | Monthly duplicate storage cost | Reduce by 50%+ | Metadata-only architecture cost |
| AI projects lack data inventory | Dataset discovery lead time | Days → hours | Catalog completeness |
| Business users need governed discovery | % searchable assets in BI/AI tools | 80%+ of approved metadata visible | Expose curated metadata views to Athena, Databricks, Snowflake, or BI tools |
Try It Yourself
FSx for ONTAP prerequisites:
- SVM and volume selected as catalog scope
- S3 Access Point attached to the target volume
- Associated UNIX or Windows identity documented
- NFS/SMB production workload impact reviewed
- CloudWatch metrics dashboard enabled
# Clone the repo
git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations/integrations/iceberg-metadata-catalog
# Install dependencies
pip install -r requirements.txt
# Run the demo (requires FSx for ONTAP with S3 Access Point)
cd demo/scripts
./run-demo.sh --ap-alias <your-ap-alias-ext-s3alias>
Don't have FSx for ONTAP? You can still explore the architecture:
What's Next
This is Part 1 of a 3-part series:
- Part 1 (this article): Architecture & PoC Results
- Part 2: AI Enrichment Pipeline — Bedrock Vision + Titan Embeddings + OpenSearch NextGen
- Part 3: Governance & Cross-Platform Access — Lake Formation, PII Anonymization, Databricks/Snowflake Integration
Key Takeaways
- Don't copy data to make it searchable — catalog the metadata instead. Apache Iceberg + S3 Tables gives you a managed metadata layer with time travel.
- Selective AI enrichment plus scale-to-zero search can keep PoC and low-traffic environments cost-efficient — compute/search components idle near $0; persistent metadata and logs may incur small charges.
- 42 seconds, $0.07 — that's the barrier to entry for an AI-powered data catalog on your existing NAS storage.
- Start small, grow incrementally — from metadata-only scan (Level 1) to full business workflow integration (Level 5). See the Production Maturity Model for the progression path.
All code and documentation is available at github.com/Yoshiki0705/fsxn-lakehouse-integrations. Feedback welcome via GitHub Issues.
Top comments (0)