Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 22

Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points

#aws #amazonfsxfornetappontap #athena #lakehouse

TL;DR

You can query files stored on Amazon FSx for NetApp ONTAP directly from Amazon Athena through an FSx-attached S3 Access Point — without copying the source data to an S3 bucket. The source files remain on the FSx for ONTAP volume and are accessed through S3 object APIs.

I verified this end-to-end: Parquet files written via NFS are immediately queryable from Athena using the official AWS tutorial pattern.

This is Part 1 of a series exploring how FSx for ONTAP S3 Access Points integrate with various Lakehouse platforms. Part 2 covers Databricks — where platform security boundaries make things significantly more complex.

GitHub Repository: fsxn-lakehouse-integrations

If you want to reproduce this validation, start from the repository's integrations/athena/ directory, which contains CloudFormation templates, sample data generators, and query scripts.

What Is Verified in This Article

Verified:

NFS-written Parquet file is visible via FSx S3 AP (ListObjectsV2, StorageClass: FSX_ONTAP)
Athena can query the file through Glue Data Catalog
Standard S3 bucket result location works as the documented pattern
Experimental FSx S3 AP result output worked in my environment

Not verified:

Delta / Hudi / Iceberg writes
CTAS production pattern to FSx S3 AP
S3 bucket event notification semantics
Large-scale performance limits
CloudTrail data event coverage (audit evidence approach should be validated per environment)

Why This Matters

Enterprise file servers hold massive amounts of data — design files, inspection images, research documents, log archives. Traditionally, to analyze this data with cloud-native tools like Athena, you had to:

Copy data from NFS/SMB to S3 (DataSync, scripts, etc.)
Maintain sync pipelines
Pay for duplicate storage
Deal with stale data

FSx for ONTAP S3 Access Points (launched December 2025) change this. The same volume that serves NFS/SMB clients now exposes an S3-compatible API. Athena queries hit the same bytes that your NFS clients read — no copy required for the source dataset.

Users (NFS/SMB)                    Athena (S3 API)
      │                                  │
      ▼                                  ▼
┌─────────────────────────────────────────────┐
│         FSx for ONTAP Volume                │
│         /analytics/sensor_data.parquet      │
│         /analytics/logs/*.json              │
└─────────────────────────────────────────────┘

Use Cases This Unlocks

This pattern is useful when enterprise data already lives on NFS/SMB file shares and analytics teams want to query it without building a copy pipeline to S3.

Examples:

Manufacturing: Sensor logs, inspection results, quality reports produced by factory systems
SAP / ERP: Batch export files, operational reports, reconciliation extracts, and analytics copies — not direct replacement for application-native persistence or HA design
Financial services: Reconciliation files, transaction logs, regulatory extracts
Healthcare research: De-identified datasets, imaging metadata, study outputs
EDA / Semiconductor: Design artifacts, simulation outputs, verification logs
Enterprise file services: Archives for compliance analysis, audit evidence

Mission-critical workload note
This pattern provides an analytics read-access layer for existing file data. It does not replace workload-specific HA, backup, Snapshot, SnapMirror, or DR designs. For SAP, databases, VDI, and enterprise file services, treat Athena-on-FSx as an analytics and evidence layer, not as the primary resilience architecture.

Workload Isolation Guidance

For mission-critical workloads, do not point exploratory analytics directly at the same directory used by latency-sensitive application writes unless the operational impact has been tested.

Recommended pattern:

Application-owned path: /prod/app-output/
Analytics landing path: /analytics/curated/
Athena query result path: Standard S3 bucket (conservative), or a separately validated output path
Snapshot / backup policy: Owned by the workload team
Glue/Athena access: Owned by the analytics platform team

For SAP, database exports, or ERP file drops, treat this pattern as a read-access analytics layer. Do not change application HA, backup, restore, or DR design just because the files are queryable through S3 APIs.

In this context, an analytics copy means an application-produced or batch-exported file that is safe for downstream analytics, not the primary application persistence path.

Operational Impact Validation

Before production use, validate operational impact:

Baseline NFS/SMB workload latency and throughput before enabling analytics queries
Athena query behavior during normal application write activity
FSx provisioned throughput utilization during scans (analytics and application workloads share the same backend throughput)
Query concurrency limits for the analytics team
Rollback plan if analytics workload affects application workload

Recommended metrics include FSx throughput utilization, client-side NFS/SMB latency, Athena query runtime, bytes scanned, and application-side error or timeout rates during query execution.

Rollback plan examples include disabling the Athena workgroup, revoking the S3 Access Point policy for analytics roles, reducing analytics query concurrency, or moving analytics to an isolated curated path.

What This Means for Production

For production, treat this as a shared-storage analytics access pattern. The value is eliminating source data copy; the responsibility is validating workload isolation, throughput impact, governance, and rollback.

This article is not a production certification. It is intended to start a production readiness discussion around workload isolation, governance, and rollback.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  AWS Account                                                    │
│                                                                 │
│  ┌──────────────┐     ┌──────────────┐     ┌────────────────┐   │
│  │ FSx for ONTAP│     │ S3 Access    │     │ Athena         │   │
│  │ Volume       │◄────│ Point        │◄────│ (Serverless)   │   │
│  │              │     │ (Internet    │     │                │   │
│  │ /analytics/  │     │  origin)     │     │ SELECT ...     │   │
│  └──────────────┘     └──────────────┘     │ FROM table     │   │
│        ▲                     ▲             └────────────────┘   │
│        │                     │                      │           │
│   NFS/SMB clients      Glue Crawler          Query results      │
│   (write data)         (schema discovery)    (→ S3 bucket)      │
└─────────────────────────────────────────────────────────────────┘

Key points:

The access point must use Internet network origin. Athena accesses S3 from managed infrastructure outside your VPC. The AWS tutorial requires internet network origin for this path. VPC-origin access points deny requests from Athena.
Glue Data Catalog provides the schema layer between Athena and the S3 AP
Query results are written to an S3 bucket (the standard Athena pattern), not back to the FSx volume. See Observed Behavior for an experimental alternative.

Prerequisites

FSx for ONTAP file system (ONTAP 9.17.1+)
A volume with data (Parquet, CSV, JSON, etc.)
S3 Access Point created with Internet network origin
An Athena workgroup with a query results location (standard S3 bucket)
IAM permissions for Athena, Glue, and S3 AP access

Step 1: Create the S3 Access Point

aws fsx create-and-attach-s3-access-point \
  --name my-analytics-ap \
  --type ONTAP \
  --ontap-configuration '{
    "VolumeId": "<YOUR_VOLUME_ID>",
    "FileSystemIdentity": {
      "Type": "UNIX",
      "UnixUser": {"Name": "fsxn_athena_reader"}
    }
  }' \
  --region <YOUR_REGION>

Wait for the lifecycle to become AVAILABLE:

aws fsx describe-s3-access-point-attachments \
  --filters Name=volume-id,Values=<YOUR_VOLUME_ID> \
  --region <YOUR_REGION> \
  --query 'S3AccessPointAttachments[].{Name:Name,Lifecycle:Lifecycle,Alias:S3AccessPoint.Alias}'

Output:

[{
  "Name": "my-analytics-ap",
  "Lifecycle": "AVAILABLE",
  "Alias": "my-analytics-ap-xxxxxxxxxxxxxxxxxxxxxxxxxxxx-ext-s3alias"
}]

Note: The alias ending in -ext-s3alias identifies this as an FSx for ONTAP S3 Access Point (as opposed to regular S3 Access Points which end in -s3alias).

Security note for file-system identity
This walkthrough uses a dedicated read-only identity (fsxn_athena_reader). Make sure the corresponding UNIX/Windows permissions allow read access to the analytics path. Avoid using root in production — scope the identity to the minimum permissions required.

Step 2: Set the Access Point Policy

This walkthrough uses role-based principals for Athena and Glue. Replace the placeholder role ARNs with the IAM roles used by your Athena workgroup and Glue crawler. Avoid account-wide principals in production.

aws s3control put-access-point-policy \
  --account-id <YOUR_ACCOUNT_ID> \
  --name my-analytics-ap \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "AllowAnalyticsRead",
      "Effect": "Allow",
      "Principal": {"AWS": [
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>",
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<GLUE_CRAWLER_ROLE>"
      ]},
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap",
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/*"
      ]
    }]
  }' \
  --region <YOUR_REGION>

The policy above is the conservative read-only analytics policy. If you intentionally test query result output to the FSx S3 Access Point (see Observed Behavior), add s3:PutObject scoped to the experimental output prefix only:

{
  "Sid": "AllowExperimentalResultWrite",
  "Effect": "Allow",
  "Principal": {"AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>"},
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/athena-results/*"
}

Security note: FSx for ONTAP S3 Access Points enforce S3 Block Public Access by default — this cannot be disabled. All requests require valid IAM credentials. Additionally, the file system user associated with the access point must have read permission on the files being queried.

Policy note: The policy above is the minimum that worked in my validation. If your Glue crawler or Athena workgroup reports location-related access errors, compare the policy with the official tutorial and CloudTrail events, and add only the required actions.

Step 3: Upload Test Data via NFS

On a machine with NFS access to the FSx volume:

import pandas as pd
import numpy as np

# Generate 10,000 rows of sensor data
np.random.seed(42)
n_rows = 10000
df = pd.DataFrame({
    'timestamp': pd.date_range('2026-01-01', periods=n_rows, freq='1min'),
    'sensor_id': np.random.choice(['sensor_A', 'sensor_B', 'sensor_C',
                                    'sensor_D', 'sensor_E'], n_rows),
    'temperature': np.round(np.random.normal(25, 5, n_rows), 2),
    'humidity': np.round(np.random.uniform(30, 90, n_rows), 2),
    'pressure': np.round(np.random.normal(1013, 10, n_rows), 2),
    'status': np.random.choice(['normal', 'warning', 'critical'], n_rows,
                                p=[0.85, 0.12, 0.03])
})

# Write as Parquet to the NFS-mounted volume
df.to_parquet('/mnt/fsxn/analytics/sensor-data/sensor_data.parquet', index=False)
print(f"Written {len(df)} rows, {df.memory_usage(deep=True).sum()/1024:.0f} KB")

The same file is now accessible via both NFS (/mnt/fsxn/analytics/sensor-data/sensor_data.parquet) and S3 API (s3://<AP_ALIAS>/sensor-data/sensor_data.parquet).

Step 4: Verify S3 AP Access

aws s3api list-objects-v2 \
  --bucket "$AP_ALIAS" \
  --prefix "sensor-data/" \
  --region <YOUR_REGION>

Output:

{
  "Contents": [{
    "Key": "sensor-data/sensor_data.parquet",
    "Size": 252858,
    "StorageClass": "FSX_ONTAP"
  }]
}

Note the StorageClass: FSX_ONTAP — this confirms the data lives on FSx, not S3.

Step 5: Create Glue Database and Table

aws glue create-database \
  --database-input '{"Name": "fsxn_analytics"}' \
  --region <YOUR_REGION>

You can either run a Glue Crawler for automatic schema discovery (recommended by the AWS tutorial), or create the table manually via Athena:

CREATE EXTERNAL TABLE fsxn_analytics.sensor_data (
  timestamp TIMESTAMP,
  sensor_id STRING,
  temperature DOUBLE,
  humidity DOUBLE,
  pressure DOUBLE,
  status STRING
)
STORED AS PARQUET
LOCATION 's3://<AP_ALIAS>/sensor-data/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

Step 6: Query with Athena

Basic aggregation

SELECT
  sensor_id,
  COUNT(*) AS readings,
  ROUND(AVG(temperature), 2) AS avg_temp,
  ROUND(AVG(humidity), 2) AS avg_humidity,
  SUM(CASE WHEN status = 'critical' THEN 1 ELSE 0 END) AS critical_count
FROM fsxn_analytics.sensor_data
GROUP BY sensor_id
ORDER BY critical_count DESC;

Verified result

sensor_id | readings | avg_temp | avg_humidity | critical_count
----------|----------|----------|--------------|---------------
sensor_A  |    2027  |   24.89  |    59.84     |      68
sensor_B  |    1986  |   25.11  |    60.23     |      62
sensor_C  |    2013  |   24.95  |    59.91     |      59
sensor_D  |    1974  |   25.03  |    60.15     |      55
sensor_E  |    2000  |   24.98  |    60.02     |      56

Query time: 1.46 seconds | Data scanned: 67 KB | Engine: Athena v3

Observed Behavior: Query Results Written to the FSx S3 Access Point

The AWS tutorial states:

"Athena reads data from your FSx for ONTAP volume through the access point. Athena query results are written to the Amazon S3 results bucket, not back to the FSx for ONTAP volume."

In my validation, however, setting OutputLocation to the FSx for ONTAP S3 Access Point alias succeeded and wrote the .csv and .metadata files back to the FSx volume:

aws athena start-query-execution \
  --query-string "SELECT 1 AS test" \
  --result-configuration \
    "OutputLocation=s3://<AP_ALIAS>/athena-results/" \
  --work-group primary \
  --region <YOUR_REGION>

Result: SUCCEEDED in 584ms

The result files appeared on the FSx volume and were immediately accessible via NFS.

Treat this as observed behavior from my environment, not a general production recommendation. The conservative production pattern is:

Source data: FSx for ONTAP S3 Access Point
Athena query results: Standard S3 bucket (as documented)

The experimental pattern validated in this post:

Source data: FSx for ONTAP S3 Access Point
Athena query results: FSx for ONTAP S3 Access Point (observed to work, not documented)

Validate this in your own environment before relying on it.

Governance warning: Do not enable experimental query result output to FSx S3 AP for sensitive datasets unless query result retention, encryption, audit evidence, and file-system permissions are reviewed. Query results may contain derived sensitive information. For sensitive datasets, experimental result output should require approval from the data owner, security owner, and workload owner.

Performance Characteristics

Metric	Observed	Notes
Simple SELECT query	584 ms	Includes result write
Aggregation (10K rows, 67KB)	1.46 s	GROUP BY with 5 aggregations
Data scan cost	Standard Athena pricing	$5 per TB scanned
Storage class	FSX_ONTAP	Confirmed in ListObjects

Performance note
These numbers validate functional compatibility, not performance limits. The dataset is intentionally small (67 KB, 10K rows). For real analytics workloads, test with realistic file sizes, object counts, partition layouts, concurrent queries, and FSx provisioned throughput. The throughput available through the S3 API depends on the FSx file system's provisioned throughput capacity (AWS documentation).

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points expose file data through S3 object APIs, but they should not be treated as standard S3 buckets.

The safe mental model is:

Use S3 APIs for object read/write access to files on FSx
Use Glue and Athena for read-oriented analytics
Do not assume S3 bucket-level features exist (event notifications, versioning, lifecycle policies)
Do not assume lakehouse commit semantics (rename, conditional writes)
Validate every platform integration separately

In this article, the verified pattern is read-oriented analytics over Parquet/CSV/JSON files. Transactional table formats and commit protocols are outside the safe default boundary.

Compatibility Matrix

Validated by legend:

This validation: Actually executed commands or queries in this environment and confirmed the result
Supported operations review: Confirmed based on the supported operations documentation or official tutorial
Supported operations review required: Not yet confirmed; additional validation needed before use

Capability	Status	Validated by	Notes
ListObjectsV2	✅ Verified	This validation	S3 AP alias worked
GetObject (Parquet scan)	✅ Verified	This validation	Athena v3
PutObject (small result file)	⚠️ Observed	This validation	Not documented as Athena result pattern
Glue table over S3 AP	✅ Verified	This validation	Manual DDL and Crawler
CTAS to S3 AP	❌ Failed in validation	This validation	Not part of the documented tutorial pattern; use standard S3 output
Delta Lake writes	❌ Not recommended	Supported operations review	Commit protocol depends on rename/atomic semantics not available
Hudi/Iceberg writes	❌ Not recommended	Supported operations review	Requires commit semantics beyond simple object read
S3 bucket event notifications	❌ Not part of verified pattern	Supported operations review required	Do not assume bucket-level eventing; validate against supported operations

CTAS is a write-path pattern, not just a read query. Treat CTAS separately from read-oriented SELECT validation because it writes new table data to a target S3 location and may leave partial/orphaned files on failure. CTAS should not be included in the initial read-oriented validation scope.

Transactional lakehouse formats may require semantics beyond simple object read/write, such as:

Atomic commit behavior
Rename or move-like commit operations
Conditional writes (If-None-Match)
Manifest consistency
Concurrent writer coordination
Cleanup of partial/orphaned files

This article does not validate those semantics. It validates read-oriented analytics over existing files.

Governance and Compliance Considerations

This pattern keeps the source files on FSx for ONTAP, but it does not remove the need for data governance.

Before using this pattern with regulated or sensitive datasets, review:

Data classification of source files
IAM and S3 Access Point policy scope (least privilege)
File system identity mapped to the access point (UNIX/Windows user permissions apply)
Glue Data Catalog permissions (who can see the table metadata)
Athena workgroup controls (query limits, result encryption)
Query result location and retention (results may contain derived sensitive data)
CloudTrail / audit evidence requirements
Snapshot, backup, retention, and deletion policy

Query results can be more sensitive than the original dataset because they may aggregate, filter, or derive new information. Apply encryption, retention, and access controls to the Athena result location as carefully as the source dataset.

This article is a technical validation, not a compliance attestation.

Production Controls Checklist

For regulated or sensitive datasets, define the following before production use:

[ ] Athena workgroup result location (standard S3 bucket)
[ ] Whether workgroup settings override client-side result settings
[ ] Query result encryption mode and KMS key ownership
[ ] Query result retention and deletion policy
[ ] IAM principals allowed to query the Glue table
[ ] File-system identity mapped to the S3 Access Point (dedicated, not root)
[ ] Audit evidence approach defined and validated (e.g., CloudTrail coverage for the S3 Access Point where applicable, with sample events captured as PoC evidence)
[ ] Approval process for enabling experimental result output to FSx S3 AP

For regulated workloads, consider enabling Athena workgroup override so that query result location and encryption cannot be changed by client-side settings. This prevents individual clients from changing where query results are written or how they are encrypted.

For regulated workloads, experimental writeback should be disabled by default and enabled only after explicit approval from the data owner, security owner, and workload owner.

Experimental writeback may be enabled only when:

Approval scope is documented
Output path is isolated from source data
Encryption and retention are defined for the output path
Cleanup and rollback procedures are documented
Review expiration date is set

Minimum audit evidence artifacts for PoC completion:

Scope statement: what the audit evidence demonstrates and what it does not (e.g., "validates access path and query result control for PoC scope; does not demonstrate full production compliance")
Access path description (IAM → AP policy → file-system identity)
Sample successful read event
Sample denied access event (if applicable)
Query result location configuration
Encryption configuration
Workgroup override setting (if used)
Reviewer sign-off (name, role, date, decision)

30-Minute Validation Flow

Create or verify the FSx S3 Access Point (AVAILABLE lifecycle)
Write one Parquet file through NFS to the analytics path
Confirm StorageClass: FSX_ONTAP with list-objects-v2
Create the Glue table (manual DDL or crawler)
Run one Athena query
Capture the validation artifacts (see below)
Decide Go / No-Go using the PoC Success Criteria

First Success Path

If you are validating this for the first time, keep the scope small.

Expected outcome:

One Parquet file written through NFS is visible through the S3 Access Point
Glue table creation or crawler schema discovery succeeds
Athena can query the file in place
Query result location behavior is validated and documented
NFS/SMB clients can still access the original file
IAM and file-system identity boundaries are understood

Do not start with Delta Lake, Hudi, Iceberg writes, large scans, or concurrent workloads. Prove the read path first.

PoC Success Criteria

Minimum success:

S3 Access Point attachment is AVAILABLE
ListObjectsV2 returns the expected test file
Glue table points to the S3 AP alias
Athena query succeeds and returns correct results
Results are reproducible from a clean workgroup/session

Operational success:

IAM role and S3 AP policy are scoped to the analytics roles
Athena workgroup controls are defined
Query result location and retention are documented
Dataset size and scan cost are measured
FSx throughput impact is measured during query
Existing NFS/SMB application workload impact is measured during Athena queries

Go / No-Go criteria:

Go: Read-only analytics on Parquet/CSV/JSON works with acceptable latency and cost
No-Go: Workload requires Delta/Hudi/Iceberg write commits through the S3 AP
No-Go: Platform governance requires Unity Catalog external locations and the platform cannot yet authorize the S3 AP (see Part 2)

Performance Test Plan

Note: This section defines the performance test plan and metrics to collect. It does not present benchmark results. Actual benchmark outputs will be added under verification-pack/ after validation runs are completed.

The next validation should include:

1 GB / 10 GB / 100 GB datasets
Many small files vs fewer large Parquet files
Partitioned layout (date=YYYY-MM-DD/sensor_id=...)
Concurrent Athena queries
Different FSx throughput capacity settings (128 / 256 / 512+ MBps)
NFS writer activity during Athena scans
Standard S3 result bucket vs observed FSx S3 AP result output

The goal is to separate Athena scan behavior, Glue metadata behavior, and FSx provisioned-throughput impact.

Additional request pattern considerations:

Sequential vs parallel S3 API reads
Prefix layout impact on listing performance
Small object listing overhead
Repeated query behavior with warm Glue/Athena metadata

Metrics collection sources:

FSx metrics: CloudWatch (FSx namespace)
Athena query metrics: get-query-execution API (EngineExecutionTimeInMillis, DataScannedInBytes)
Client-side latency: CLI timing or SDK instrumentation
Error/timeout sources: Athena query execution status and failure reason, client-side logs, application-side timeout logs, CloudTrail events where applicable

Record results separately for cold run (1+), warm metadata run (1+), repeated run (3+ executions). Report average, min, max, and notable outliers.

Validation Artifacts

For reproducibility, capture the following artifacts in your PoC:

S3 Access Point attachment lifecycle output (describe-s3-access-point-attachments)
list-objects-v2 output showing StorageClass: FSX_ONTAP
Glue table DDL or crawler output
Athena query execution ID
Athena query runtime and scanned bytes
Query result location and file listing
NFS listing showing the original source file is unchanged
IAM policy and access point policy used for the test

What's Next

In Part 2, I'll cover what happens when you try to connect Databricks to FSx for ONTAP S3 Access Points — where Unity Catalog's session policy, seccomp filters, and platform security boundaries create a significantly more complex picture.

References

This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) in May 2026.

Scope reminder: This article verifies a limited read-oriented scenario. It does not validate production readiness, write-path behavior, distributed executor-scale processing, or all third-party analytics engines.

Article update plan: v1.0 (current) — Scope, observed behavior, validation plan. Future updates: v1.1 — Benchmark results with realistic datasets. v1.2 — Security Verified candidate review. v1.3 — Production workload isolation test results.

DEV Community