Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on Jun 7

Evidence Expansion, Presigned URL Discovery, and Operational Surprises — FSx for ONTAP S3 Access Points, Phase 14

#aws #amazonfsxfornetappontap #s3accesspoints #serverless

TL;DR

Phase 14 shifts from building patterns to hardening the evidence base. After publishing Phase 13's field-ready reference architecture, I focused on post-publication refinement: Partner/SI delivery assets, benchmark methodology standardization, S3 AP compatibility clarification (Presigned URLs work despite documentation), and an unexpected operational discovery — S3 Access Points become unavailable during FSx throughput capacity changes.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Why Phase 14?

Phase 13 delivered the field-ready baseline. Phase 14 answers the question: "Now that the patterns exist, how do we make them easier to evaluate, adopt, and operate?"

The work falls into four categories:

Partner/SI delivery acceleration — one-pager, improved PoC templates, FC1-FC6 conversation starters
Benchmark methodology — standardized run IDs, hypothesis-driven testing, Range GET plans
Compatibility clarification — Presigned URL behavior confirmed with AWS Support
Operational discovery — S3 AP unavailability during throughput capacity changes

1. Partner/SI One-Pager: What / When / How / Where

Partners and SIs told us the existing 7-step delivery checklist was comprehensive but too long for a first conversation. Phase 14 adds a single-page overview that answers four questions:

Section	Content
What	28 UCs + 6 FC patterns, CloudFormation templates, 4-level maturity model
When	Customer has FSx for ONTAP + needs serverless file processing + permission-aware access
How	Identify UC → Deploy template → Measure baseline → Evaluate Go/No-Go
Where	Links to Success Metrics, Governance, Production Readiness, Benchmarks

Available in both Japanese and English.

FC1-FC6 Recommended First Questions

Each FlexCache/FlexClone pattern now has a recommended first conversation question — the question a Partner/SI should ask to determine if the pattern is relevant:

Pattern	First Question
FC1 (Anycast/DR)	"What is your current read latency from remote sites, and what target would justify a caching layer?"
FC2 (Render)	"How many concurrent render jobs share the same source data, and what is the job lifecycle?"
FC3 (RAG)	"Which file shares contain the knowledge base, and do access permissions need to be preserved in RAG results?"
FC4 (CAE)	"What is the typical solver output size and how quickly must results be available for post-processing?"
FC5 (Life Sciences)	"How do you currently share research datasets between teams while maintaining data governance?"
FC6 (Gaming)	"What is your current build pipeline duration and which asset validation steps are bottlenecks?"

2. Presigned URLs: "Not Supported" but Working

⚠️ Production Warning: AWS Support explicitly states that operations marked "Not supported" should NOT be relied upon for production workloads, even when they return success today. The behavior may change without deprecation notice, return inconsistent results across regions, or stop working after service updates. Design alternatives for any workflow that requires presigned URL access to FSx for ONTAP S3 Access Points.

The Discovery

The FSx for ONTAP S3 AP compatibility table lists Presign — Not supported. However, testing showed presigned URLs for GetObject work successfully.

AWS Support Clarification

After raising this with AWS Support, the explanation was clear:

Presigning is client-side only — aws s3 presign computes a SigV4 signature locally. No network request is made.
The presigned URL executes a standard GetObject — signature is in query parameters instead of the Authorization header.
Since GetObject is Supported, presigned URLs cannot be blocked without breaking GetObject itself.
The documentation likely intended to indicate that presigned URL workflows are not officially tested.

Production Guidance

Feature	Status	Guidance
GetObject, PutObject, ListObjectsV2	Supported	Build on freely
Conditional writes (If-None-Match)	Blocked	Returns NotImplemented
Presigned URLs	Not supported (doc)	Works but do not rely on for production

AWS Support has escalated documentation clarification to the FSx for ONTAP service team. The distinction between "Not supported + hard-blocked" (returns error) and "Not supported + may incidentally work" (no guarantees) is being reviewed.

Readers should verify the latest AWS documentation before relying on this behavior, as the status may change.

Alternatives to Presigned URLs

If you need time-limited or delegated file access without relying on unsupported behavior:

API Gateway + Lambda proxy with IAM or JWT authorization
CloudFront signed URLs backed by a controlled Lambda@Edge origin
Temporary STS credentials with scoped IAM permissions (time-limited, per-object or prefix)
Application-level download broker with audit logging and access revocation

For a broader comparison with standard S3 bucket semantics, see the S3 Bucket User Guide.

3. Benchmark Methodology: Hypothesis-Driven Testing

Why 1769 MB Lambda Memory?

Lambda memory directly controls CPU and network bandwidth allocation. At 1769 MB, Lambda receives exactly 1 vCPU equivalent, providing consistent and reproducible network throughput for benchmark measurements. Lower memory settings would introduce variable network bandwidth as a confounding factor.

Benchmark Run ID Convention

Every benchmark run now follows a standardized format:

s3ap-bench-{YYYY-MM-DD}-{seq}

With mandatory fixed conditions:

Region: ap-northeast-1
Lambda memory: 1769 MB (1 vCPU)
Lambda architecture: arm64
FSx Throughput Capacity: [128 / 256 / 512] MBps
Iterations per data point: 50
Statistics: p50, p90, p95, p99, min, max
Concurrent NFS/SMB workload: [None / Light / Production-level]

Hypothesis: Throughput Capacity vs Practical Concurrency Point

Based on 128 MBps observations where I observed concurrency=10 as the practical upper limit in this specific test environment (1 MB objects, single Lambda invocation pattern, no concurrent NFS/SMB workload), I hypothesize:

Practical concurrency point may shift with FSx throughput capacity increase.

FSx Capacity	Predicted Practical Concurrency	Rationale
128 MBps	10 (observed)	Baseline — P99 exceeded 420 ms
256 MBps	~15-25	Sub-linear scaling is plausible due to ONTAP WAFL overhead and TCP connection management
512 MBps	~25-45	Step-function behavior possible if a different bottleneck emerges

Note: Linear scaling (2x capacity = 2x concurrency) is one possible outcome, but sub-linear or step-function behavior is equally plausible. The actual relationship depends on ONTAP data plane queuing, TCP connection overhead, and whether the bottleneck shifts from throughput to IOPS or latency at higher capacities.

Initial verification was blocked by the S3 AP issue described below. Results were published after recovery on 2026-05-25.

Update (2026-05-25): S3 AP recovered. Benchmarks completed. See Section 7 for results and hypothesis verification.

4. Operational Discovery: S3 AP Unavailability During Throughput Changes

What Happened

While preparing to run 256 MBps benchmarks, I changed the FSx throughput capacity from 128 to 256 MBps. After the change completed successfully:

All S3 Access Points on the file system returned ServiceUnavailable
All SVMs were affected (not just one)
Reverting to 128 MBps did not immediately restore S3 AP access
The file system itself remained AVAILABLE throughout

Timeline

Time	Event
T+0	`update-file-system` ThroughputCapacity 128 → 256
T+25 min	Change completed (256 MBps confirmed)
T+25+ min	All S3 APs return ServiceUnavailable
T+40 min	Revert initiated (256 → 128)
T+65 min	Revert completed, S3 APs still unavailable

Impact and Recommendation

This is now tracked as an AWS Support case. Key takeaways:

Plan throughput capacity changes during maintenance windows. S3 AP workloads may be disrupted for an extended period.

Unlike standard S3 buckets, FSx for ONTAP S3 AP availability can be affected by FSx file system operational changes such as throughput capacity updates.

Important context: AWS documentation states that NFS/SMB access typically remains available during throughput capacity changes. The S3 AP disruption I observed appears to be specific to the S3 Access Point data plane — not the file system's NFS/SMB data LIFs. This distinction matters for environments that use both protocols.

For regulated environments (FISC, healthcare, government): Throughput capacity changes must be included in change management procedures. If S3 AP-based workloads have SLA requirements, the change should be approved through the organization's change advisory board with documented rollback procedures.

This finding has been added to the Production Readiness document as a Level 3+ operational consideration.

5. DEV.to Series Cleanup

Phase 14 also cleaned up the article series:

Update Notes added to Phase 1, 9, 10, 12 articles linking to Phase 13
Permission-Aware RAG articles moved to a separate series (was incorrectly mixed into FSx S3AP series)
Series now has 14 articles in "FSx for ONTAP S3 Access Points" (down from 16 after RAG separation)

Blockers and Current Status

Item	Blocker	Resolution
~~256/512 MBps benchmark~~	~~S3 AP ServiceUnavailable~~	✅ Resolved 2026-05-25 — Results below
FC1 Recovery Metrics	FlexCache × S3 AP integration	Pending AWS feature availability
~~Hypothesis verification~~	~~Depends on benchmark~~	✅ Partially confirmed — see results

6. Benchmark Results: 128 / 256 / 512 MBps Concurrency Comparison

S3 AP ServiceUnavailable was resolved on 2026-05-25. We immediately executed the planned benchmark across all three throughput tiers.

Test Environment

Note on methodology divergence: The benchmark methodology (Section 3) defines 1769 MB Lambda + 50 iterations as the standard. The Internet tests below used a macOS client + 10 iterations per concurrency level due to the initial exploratory nature of these measurements. The Lambda egress test (Section 8) follows the 1769 MB / 50 iteration standard. Treat Internet results as directional sizing guidance, not as statistically rigorous benchmarks.

Parameter	Value
Region	ap-northeast-1 (Tokyo)
FSx for ONTAP	Single-AZ, First-generation
S3 AP	NetworkOrigin=Internet
Client	macOS, boto3, Python 3.9 (public Internet)
Object sizes	1 KB, 100 KB, 1 MB
Concurrency	1, 5, 10, 20, 50
Iterations	10 per concurrency level (exploratory)

Key Results: 1 MB GetObject P99

Concurrency	128 MBps	256 MBps	512 MBps
1	76 ms	93 ms	96 ms
5	160 ms	175 ms	308 ms
10	239 ms	236 ms	229 ms
20	981 ms	481 ms	738 ms
50	—	850 ms	4,495 ms

Analysis

P50 (median) is largely independent of throughput capacity — Internet baseline latency (connection + TLS) dominates
P99 (tail latency) shows the difference — 128→256 MBps improved P99 by 51% at concurrency=20
512 MBps shows no improvement over 256 MBps via Internet — client-side bandwidth (~100 Mbps) becomes the bottleneck
Hypothesis partially confirmed: Practical concurrency point does shift with throughput capacity, but the relationship is non-linear and bounded by client bandwidth in Internet-origin tests

Sizing Guidance

Workload	128 MBps	256 MBps	512 MBps
Small files (< 10 KB)	MaxConcurrency=20	50	50
Medium files (100 KB)	10	20	50
Large files (1 MB+)	5	10	20

These are sizing references from a specific test environment, not service limits. A VPC-internal Lambda + VPC-origin S3 AP path is expected to reduce public Internet overhead, but remains untested and must be validated separately. Always validate with your own workload profile.

What This Means for Production

For PoC (128 MBps): Keep Step Functions Map state MaxConcurrency ≤ 5 for 1 MB+ files
For Production (256+ MBps): MaxConcurrency=10-20 is safe for most workloads
For VPC-internal Lambda (untested): Expected to further reduce latency by eliminating public Internet path, but requires VPC-origin S3 AP (not yet measured)
Throughput capacity changes: Plan during maintenance windows (S3 AP disruption risk confirmed)
Small files (< 1 KB): Throughput capacity increase has no effect — bottleneck is connection overhead, not bandwidth. Save costs by staying at 128 MBps for metadata-heavy workloads

7. Lambda Egress Path Benchmark: Reducing Connection Overhead

Terminology clarification: This test uses a VPC-external Lambda (no VpcConfig) accessing an Internet-origin S3 AP. The Lambda egress path goes through AWS-managed networking, which is faster than public Internet but is NOT a VPC-internal path. A true VPC-internal test would require a VPC-origin S3 AP + VPC-internal Lambda — that remains untested.

Path	Status	What it measures
Public Internet client → Internet-origin S3 AP	✅ Measured (Section 7)	End-user/CI baseline
VPC-external Lambda → Internet-origin S3 AP	✅ Measured (this section)	AWS-managed Lambda egress, NOT VPC-private
VPC-internal Lambda → VPC-origin S3 AP	❌ Not yet measured	True private path (requires new AP)

We deployed a benchmark Lambda (1769 MB, ARM64, no VpcConfig) to measure GetObject latency via AWS-managed Lambda egress.

Lambda Egress vs Internet: 1 MB GetObject P50

Concurrency	Internet P50	Lambda P50	Improvement
1	68 ms	62 ms	9%
5	117 ms	61 ms	48%
10	175 ms	73 ms	58%
20	256 ms	122 ms	52%
50	N/A	128 ms	—

Key Findings

P50 dramatically improved at concurrency > 1: Lambda egress eliminates public Internet TCP connection overhead
P99 remains high (~1s): Even from Lambda, concurrency=20 shows P99 of 1,318 ms — this is the S3 AP data plane's internal queuing
concurrency=50 P50 is only 128 ms: Lambda threads are efficient against S3 AP
The bottleneck is the FSx for ONTAP S3 AP data plane, not Lambda network bandwidth

Production Sizing (Lambda)

Workload	Recommended MaxConcurrency	Expected P50	Expected P99
Small files (1 KB)	50	~63 ms	~994 ms
Medium files (100 KB)	20	~79 ms	~1,044 ms
Large files (1 MB)	10	~73 ms	~928 ms

Set Lambda timeout to 30s+ and use Step Functions Retry to handle P99 spikes. These results are from VPC-external Lambda (AWS-managed egress), not true VPC-internal path.

8. SQS Replay Storm Simulation: Zero Message Loss Under Load

Scope clarification: This test validates the downstream SQS ingestion and Lambda consumer drain path under replay-like burst conditions. It does NOT validate ONTAP Persistent Store buffering or FPolicy TCP-level server reconnection replay. Those require a live FPolicy server environment (future work).

We simulated FPolicy server reconnection by injecting 1,000 and 10,000 events directly into SQS, mimicking the burst that occurs when a Persistent Store replays buffered events after server reconnection.

Results

Scenario	Events	Loss Rate	Throughput	Batch P99
5 min downtime	1,000	0%	188 eps	177 ms
30 min downtime	10,000	0%	464 eps	79 ms
Consumer drain	1,000	0%	341 msgs/sec	85 ms

SLO Validation

SLO Metric	Threshold	Observed	Status
Event loss rate	< 0.1%	0%	✅
Injection throughput	> 100 eps	464 eps	✅
Consumer drain rate	> injection rate	341 > 188	✅
Batch latency P99	< 200 ms	79 ms	✅
DLQ messages	0	0	✅

Implications

30-min downtime accumulates ~835K events at 464 eps. With Lambda auto-scaling (10 consumers), drain completes in < 5 minutes.
Persistent Store sizing estimate: Based on simulated event payload size, 10K events ≈ 5 MB. Real ONTAP Persistent Store sizing must be validated with live FPolicy replay.
No backpressure issues: SQS Standard queue handles burst without message loss.

9. Operational Runbook: S3 AP Disruption Response

When S3 AP becomes unavailable (e.g., during throughput capacity changes):

Check S3 AP health: ListObjectsV2 / GetObject against S3 AP alias
Check NFS/SMB separately: mount + read test (may still be functional)
Check FSx file system status: describe-file-systems → Lifecycle
Check CloudWatch alarms: Lambda errors, Step Functions failures
Pause ingestion: Disable EventBridge Schedules for affected UC pipelines
Wait and retry: S3 AP recovery may take 15-60 min after throughput changes
Escalate: If unavailable > 60 min, contact AWS Support with file system ID

See Incident Response Playbook for full procedures.

What's Next (Phase 15 candidates)

Update: Phase 15 expanded the pattern library from 17 to 28 industry-specific use cases. Items 1-2 below remain pending AWS feature availability. Items 3-4 are carried forward in Phase 15's What's Next.

FlexCache × S3 AP integration — pending AWS feature availability (not yet supported)
FC1 Recovery Metrics — route decision latency, cache health detection, failover timing (depends on #1)
Replay Storm with real FPolicy server — TCP-level replay characteristics (requires ECS re-deploy)
VPC-internal Lambda with VPC Origin S3 AP — true VPC-internal path (requires new AP with NetworkOrigin=VPC)

Multi-Account OAM validation completed 2026-05-25 — cross-account CloudWatch Metrics, Logs, and X-Ray Traces confirmed working.

Field feedback tracked: S3 AP disruption during throughput change, presigned URL documentation gap, VPC-origin benchmark gap, and FlexCache × S3 AP feature dependency. Details in docs/ontap-integration-notes.md.

Stats

Files changed: 200+ (documentation, translations, shared modules, templates)
New documents: Partner/SI one-pager (JP/EN/KO/ZH-CN), cost calculator, customization guide, incident response playbook, demo mode guide, comparison alternatives, PoC Go/No-Go template
New shared modules: data_classification.py, human_review.py, schemas/events.py
Benchmark runs: 7 (128/256/512 MBps Internet × 2 file sizes + Lambda egress + SQS replay storm simulation)
Templates fixed: 5 (cfn-lint errors: RecursiveDeleteOption, SNSPublishMessagePolicy, Handler path)
Translations added: 20 files (FC1-FC6 ko/zh-CN + FC1/FC3 full 8-lang)
samconfig.toml.example: 24 patterns
Output JSON samples: 24 patterns
DEV.to articles updated: 6 (4 Update Notes + 2 Series changes)
AWS Support cases: 1 resolved (S3 AP ServiceUnavailable — throughput change related)
Operational discoveries: 1 (throughput change → S3 AP disruption, now resolved)
Cost savings: ~$346/month (v4-test-demo + FPolicy server + VPC Endpoints + EC2 停止)
SQS Replay Storm Simulation: 10,000 events, 0% loss in downstream SQS/consumer path

Who Should Care About Phase 14?

Partners and SIs get a one-pager for first conversations and recommended questions for each FC pattern
Operations teams learn that throughput capacity changes can disrupt S3 AP access
Architects get standardized benchmark methodology with hypothesis-driven testing
Developers get Presigned URL clarification — works but don't depend on it
Standard S3 bucket users learn where FSx for ONTAP S3 AP differs from S3 bucket semantics, especially presigned URLs, availability, and operational dependencies
Serverless-first teams learn where the serverless processing plane ends and FSx for ONTAP operational considerations begin

What You Can Do Today

Phase 14 delivers immediately usable assets:

Use the Partner/SI one-pager for your next customer conversation about FSx for ONTAP + serverless
Check the S3AP Compatibility Notes for the latest Presigned URL and troubleshooting guidance
Plan throughput changes carefully — add S3 AP health checks to your maintenance runbook
Use the Sizing Guidance tables (Sections 7-8) to set MaxConcurrency for your workload
Review the S3 Bucket User Guide before porting existing S3 applications to FSx for ONTAP S3 AP
Review the ONTAP Integration Notes before attaching S3 AP workflows to production SVMs and volumes

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Full series: FSx for ONTAP S3 Access Points on DEV.to