DEV Community

Cover image for Evidence Expansion, Presigned URL Discovery, and Operational Surprises — FSx for ONTAP S3 Access Points, Phase 14

Evidence Expansion, Presigned URL Discovery, and Operational Surprises — FSx for ONTAP S3 Access Points, Phase 14

TL;DR

Phase 14 shifts from building patterns to hardening the evidence base. After publishing Phase 13's field-ready reference architecture, I focused on post-publication refinement: Partner/SI delivery assets, benchmark methodology standardization, S3 AP compatibility clarification (Presigned URLs work despite documentation), and an unexpected operational discovery — S3 Access Points become unavailable during FSx throughput capacity changes.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


Why Phase 14?

Phase 13 delivered the field-ready baseline. Phase 14 answers the question: "Now that the patterns exist, how do we make them easier to evaluate, adopt, and operate?"

The work falls into four categories:

  1. Partner/SI delivery acceleration — one-pager, improved PoC templates, FC1-FC6 conversation starters
  2. Benchmark methodology — standardized run IDs, hypothesis-driven testing, Range GET plans
  3. Compatibility clarification — Presigned URL behavior confirmed with AWS Support
  4. Operational discovery — S3 AP unavailability during throughput capacity changes

1. Partner/SI One-Pager: What / When / How / Where

Partners and SIs told us the existing 7-step delivery checklist was comprehensive but too long for a first conversation. Phase 14 adds a single-page overview that answers four questions:

Section Content
What 28 UCs + 6 FC patterns, CloudFormation templates, 4-level maturity model
When Customer has FSx for ONTAP + needs serverless file processing + permission-aware access
How Identify UC → Deploy template → Measure baseline → Evaluate Go/No-Go
Where Links to Success Metrics, Governance, Production Readiness, Benchmarks

Available in both Japanese and English.

FC1-FC6 Recommended First Questions

Each FlexCache/FlexClone pattern now has a recommended first conversation question — the question a Partner/SI should ask to determine if the pattern is relevant:

Pattern First Question
FC1 (Anycast/DR) "What is your current read latency from remote sites, and what target would justify a caching layer?"
FC2 (Render) "How many concurrent render jobs share the same source data, and what is the job lifecycle?"
FC3 (RAG) "Which file shares contain the knowledge base, and do access permissions need to be preserved in RAG results?"
FC4 (CAE) "What is the typical solver output size and how quickly must results be available for post-processing?"
FC5 (Life Sciences) "How do you currently share research datasets between teams while maintaining data governance?"
FC6 (Gaming) "What is your current build pipeline duration and which asset validation steps are bottlenecks?"

2. Presigned URLs: "Not Supported" but Working

⚠️ Production Warning: AWS Support explicitly states that operations marked "Not supported" should NOT be relied upon for production workloads, even when they return success today. The behavior may change without deprecation notice, return inconsistent results across regions, or stop working after service updates. Design alternatives for any workflow that requires presigned URL access to FSx for ONTAP S3 Access Points.

The Discovery

The FSx for ONTAP S3 AP compatibility table lists Presign — Not supported. However, testing showed presigned URLs for GetObject work successfully.

AWS Support Clarification

After raising this with AWS Support, the explanation was clear:

  1. Presigning is client-side onlyaws s3 presign computes a SigV4 signature locally. No network request is made.
  2. The presigned URL executes a standard GetObject — signature is in query parameters instead of the Authorization header.
  3. Since GetObject is Supported, presigned URLs cannot be blocked without breaking GetObject itself.
  4. The documentation likely intended to indicate that presigned URL workflows are not officially tested.

Production Guidance

Feature Status Guidance
GetObject, PutObject, ListObjectsV2 Supported Build on freely
Conditional writes (If-None-Match) Blocked Returns NotImplemented
Presigned URLs Not supported (doc) Works but do not rely on for production

AWS Support has escalated documentation clarification to the FSx for ONTAP service team. The distinction between "Not supported + hard-blocked" (returns error) and "Not supported + may incidentally work" (no guarantees) is being reviewed.

Readers should verify the latest AWS documentation before relying on this behavior, as the status may change.

Alternatives to Presigned URLs

If you need time-limited or delegated file access without relying on unsupported behavior:

  • API Gateway + Lambda proxy with IAM or JWT authorization
  • CloudFront signed URLs backed by a controlled Lambda@Edge origin
  • Temporary STS credentials with scoped IAM permissions (time-limited, per-object or prefix)
  • Application-level download broker with audit logging and access revocation

For a broader comparison with standard S3 bucket semantics, see the S3 Bucket User Guide.


3. Benchmark Methodology: Hypothesis-Driven Testing

Why 1769 MB Lambda Memory?

Lambda memory directly controls CPU and network bandwidth allocation. At 1769 MB, Lambda receives exactly 1 vCPU equivalent, providing consistent and reproducible network throughput for benchmark measurements. Lower memory settings would introduce variable network bandwidth as a confounding factor.

Benchmark Run ID Convention

Every benchmark run now follows a standardized format:

s3ap-bench-{YYYY-MM-DD}-{seq}
Enter fullscreen mode Exit fullscreen mode

With mandatory fixed conditions:

Region: ap-northeast-1
Lambda memory: 1769 MB (1 vCPU)
Lambda architecture: arm64
FSx Throughput Capacity: [128 / 256 / 512] MBps
Iterations per data point: 50
Statistics: p50, p90, p95, p99, min, max
Concurrent NFS/SMB workload: [None / Light / Production-level]
Enter fullscreen mode Exit fullscreen mode

Hypothesis: Throughput Capacity vs Practical Concurrency Point

Based on 128 MBps observations where I observed concurrency=10 as the practical upper limit in this specific test environment (1 MB objects, single Lambda invocation pattern, no concurrent NFS/SMB workload), I hypothesize:

Practical concurrency point may shift with FSx throughput capacity increase.

FSx Capacity Predicted Practical Concurrency Rationale
128 MBps 10 (observed) Baseline — P99 exceeded 420 ms
256 MBps ~15-25 Sub-linear scaling is plausible due to ONTAP WAFL overhead and TCP connection management
512 MBps ~25-45 Step-function behavior possible if a different bottleneck emerges

Note: Linear scaling (2x capacity = 2x concurrency) is one possible outcome, but sub-linear or step-function behavior is equally plausible. The actual relationship depends on ONTAP data plane queuing, TCP connection overhead, and whether the bottleneck shifts from throughput to IOPS or latency at higher capacities.

Initial verification was blocked by the S3 AP issue described below. Results were published after recovery on 2026-05-25.

Update (2026-05-25): S3 AP recovered. Benchmarks completed. See Section 7 for results and hypothesis verification.


4. Operational Discovery: S3 AP Unavailability During Throughput Changes

What Happened

While preparing to run 256 MBps benchmarks, I changed the FSx throughput capacity from 128 to 256 MBps. After the change completed successfully:

  • All S3 Access Points on the file system returned ServiceUnavailable
  • All SVMs were affected (not just one)
  • Reverting to 128 MBps did not immediately restore S3 AP access
  • The file system itself remained AVAILABLE throughout

Timeline

Time Event
T+0 update-file-system ThroughputCapacity 128 → 256
T+25 min Change completed (256 MBps confirmed)
T+25+ min All S3 APs return ServiceUnavailable
T+40 min Revert initiated (256 → 128)
T+65 min Revert completed, S3 APs still unavailable

Impact and Recommendation

This is now tracked as an AWS Support case. Key takeaways:

Plan throughput capacity changes during maintenance windows. S3 AP workloads may be disrupted for an extended period.

Unlike standard S3 buckets, FSx for ONTAP S3 AP availability can be affected by FSx file system operational changes such as throughput capacity updates.

Important context: AWS documentation states that NFS/SMB access typically remains available during throughput capacity changes. The S3 AP disruption I observed appears to be specific to the S3 Access Point data plane — not the file system's NFS/SMB data LIFs. This distinction matters for environments that use both protocols.

For regulated environments (FISC, healthcare, government): Throughput capacity changes must be included in change management procedures. If S3 AP-based workloads have SLA requirements, the change should be approved through the organization's change advisory board with documented rollback procedures.

This finding has been added to the Production Readiness document as a Level 3+ operational consideration.


5. DEV.to Series Cleanup

Phase 14 also cleaned up the article series:

  • Update Notes added to Phase 1, 9, 10, 12 articles linking to Phase 13
  • Permission-Aware RAG articles moved to a separate series (was incorrectly mixed into FSx S3AP series)
  • Series now has 14 articles in "FSx for ONTAP S3 Access Points" (down from 16 after RAG separation)

Blockers and Current Status

Item Blocker Resolution
256/512 MBps benchmark S3 AP ServiceUnavailable Resolved 2026-05-25 — Results below
FC1 Recovery Metrics FlexCache × S3 AP integration Pending AWS feature availability
Hypothesis verification Depends on benchmark Partially confirmed — see results

6. Benchmark Results: 128 / 256 / 512 MBps Concurrency Comparison

S3 AP ServiceUnavailable was resolved on 2026-05-25. We immediately executed the planned benchmark across all three throughput tiers.

Test Environment

Note on methodology divergence: The benchmark methodology (Section 3) defines 1769 MB Lambda + 50 iterations as the standard. The Internet tests below used a macOS client + 10 iterations per concurrency level due to the initial exploratory nature of these measurements. The Lambda egress test (Section 8) follows the 1769 MB / 50 iteration standard. Treat Internet results as directional sizing guidance, not as statistically rigorous benchmarks.

Parameter Value
Region ap-northeast-1 (Tokyo)
FSx for ONTAP Single-AZ, First-generation
S3 AP NetworkOrigin=Internet
Client macOS, boto3, Python 3.9 (public Internet)
Object sizes 1 KB, 100 KB, 1 MB
Concurrency 1, 5, 10, 20, 50
Iterations 10 per concurrency level (exploratory)

Key Results: 1 MB GetObject P99

Concurrency 128 MBps 256 MBps 512 MBps
1 76 ms 93 ms 96 ms
5 160 ms 175 ms 308 ms
10 239 ms 236 ms 229 ms
20 981 ms 481 ms 738 ms
50 850 ms 4,495 ms

Analysis

  1. P50 (median) is largely independent of throughput capacity — Internet baseline latency (connection + TLS) dominates
  2. P99 (tail latency) shows the difference — 128→256 MBps improved P99 by 51% at concurrency=20
  3. 512 MBps shows no improvement over 256 MBps via Internet — client-side bandwidth (~100 Mbps) becomes the bottleneck
  4. Hypothesis partially confirmed: Practical concurrency point does shift with throughput capacity, but the relationship is non-linear and bounded by client bandwidth in Internet-origin tests

Sizing Guidance

Workload 128 MBps 256 MBps 512 MBps
Small files (< 10 KB) MaxConcurrency=20 50 50
Medium files (100 KB) 10 20 50
Large files (1 MB+) 5 10 20

These are sizing references from a specific test environment, not service limits. A VPC-internal Lambda + VPC-origin S3 AP path is expected to reduce public Internet overhead, but remains untested and must be validated separately. Always validate with your own workload profile.

What This Means for Production

  • For PoC (128 MBps): Keep Step Functions Map state MaxConcurrency ≤ 5 for 1 MB+ files
  • For Production (256+ MBps): MaxConcurrency=10-20 is safe for most workloads
  • For VPC-internal Lambda (untested): Expected to further reduce latency by eliminating public Internet path, but requires VPC-origin S3 AP (not yet measured)
  • Throughput capacity changes: Plan during maintenance windows (S3 AP disruption risk confirmed)
  • Small files (< 1 KB): Throughput capacity increase has no effect — bottleneck is connection overhead, not bandwidth. Save costs by staying at 128 MBps for metadata-heavy workloads

7. Lambda Egress Path Benchmark: Reducing Connection Overhead

Terminology clarification: This test uses a VPC-external Lambda (no VpcConfig) accessing an Internet-origin S3 AP. The Lambda egress path goes through AWS-managed networking, which is faster than public Internet but is NOT a VPC-internal path. A true VPC-internal test would require a VPC-origin S3 AP + VPC-internal Lambda — that remains untested.

Path Status What it measures
Public Internet client → Internet-origin S3 AP ✅ Measured (Section 7) End-user/CI baseline
VPC-external Lambda → Internet-origin S3 AP ✅ Measured (this section) AWS-managed Lambda egress, NOT VPC-private
VPC-internal Lambda → VPC-origin S3 AP ❌ Not yet measured True private path (requires new AP)

We deployed a benchmark Lambda (1769 MB, ARM64, no VpcConfig) to measure GetObject latency via AWS-managed Lambda egress.

Lambda Egress vs Internet: 1 MB GetObject P50

Concurrency Internet P50 Lambda P50 Improvement
1 68 ms 62 ms 9%
5 117 ms 61 ms 48%
10 175 ms 73 ms 58%
20 256 ms 122 ms 52%
50 N/A 128 ms

Key Findings

  1. P50 dramatically improved at concurrency > 1: Lambda egress eliminates public Internet TCP connection overhead
  2. P99 remains high (~1s): Even from Lambda, concurrency=20 shows P99 of 1,318 ms — this is the S3 AP data plane's internal queuing
  3. concurrency=50 P50 is only 128 ms: Lambda threads are efficient against S3 AP
  4. The bottleneck is the FSx for ONTAP S3 AP data plane, not Lambda network bandwidth

Production Sizing (Lambda)

Workload Recommended MaxConcurrency Expected P50 Expected P99
Small files (1 KB) 50 ~63 ms ~994 ms
Medium files (100 KB) 20 ~79 ms ~1,044 ms
Large files (1 MB) 10 ~73 ms ~928 ms

Set Lambda timeout to 30s+ and use Step Functions Retry to handle P99 spikes. These results are from VPC-external Lambda (AWS-managed egress), not true VPC-internal path.


8. SQS Replay Storm Simulation: Zero Message Loss Under Load

Scope clarification: This test validates the downstream SQS ingestion and Lambda consumer drain path under replay-like burst conditions. It does NOT validate ONTAP Persistent Store buffering or FPolicy TCP-level server reconnection replay. Those require a live FPolicy server environment (future work).

We simulated FPolicy server reconnection by injecting 1,000 and 10,000 events directly into SQS, mimicking the burst that occurs when a Persistent Store replays buffered events after server reconnection.

Results

Scenario Events Loss Rate Throughput Batch P99
5 min downtime 1,000 0% 188 eps 177 ms
30 min downtime 10,000 0% 464 eps 79 ms
Consumer drain 1,000 0% 341 msgs/sec 85 ms

SLO Validation

SLO Metric Threshold Observed Status
Event loss rate < 0.1% 0%
Injection throughput > 100 eps 464 eps
Consumer drain rate > injection rate 341 > 188
Batch latency P99 < 200 ms 79 ms
DLQ messages 0 0

Implications

  • 30-min downtime accumulates ~835K events at 464 eps. With Lambda auto-scaling (10 consumers), drain completes in < 5 minutes.
  • Persistent Store sizing estimate: Based on simulated event payload size, 10K events ≈ 5 MB. Real ONTAP Persistent Store sizing must be validated with live FPolicy replay.
  • No backpressure issues: SQS Standard queue handles burst without message loss.

9. Operational Runbook: S3 AP Disruption Response

When S3 AP becomes unavailable (e.g., during throughput capacity changes):

  1. Check S3 AP health: ListObjectsV2 / GetObject against S3 AP alias
  2. Check NFS/SMB separately: mount + read test (may still be functional)
  3. Check FSx file system status: describe-file-systems → Lifecycle
  4. Check CloudWatch alarms: Lambda errors, Step Functions failures
  5. Pause ingestion: Disable EventBridge Schedules for affected UC pipelines
  6. Wait and retry: S3 AP recovery may take 15-60 min after throughput changes
  7. Escalate: If unavailable > 60 min, contact AWS Support with file system ID

See Incident Response Playbook for full procedures.


What's Next (Phase 15 candidates)

Update: Phase 15 expanded the pattern library from 17 to 28 industry-specific use cases. Items 1-2 below remain pending AWS feature availability. Items 3-4 are carried forward in Phase 15's What's Next.

  1. FlexCache × S3 AP integration — pending AWS feature availability (not yet supported)
  2. FC1 Recovery Metrics — route decision latency, cache health detection, failover timing (depends on #1)
  3. Replay Storm with real FPolicy server — TCP-level replay characteristics (requires ECS re-deploy)
  4. VPC-internal Lambda with VPC Origin S3 AP — true VPC-internal path (requires new AP with NetworkOrigin=VPC)

Multi-Account OAM validation completed 2026-05-25 — cross-account CloudWatch Metrics, Logs, and X-Ray Traces confirmed working.

Field feedback tracked: S3 AP disruption during throughput change, presigned URL documentation gap, VPC-origin benchmark gap, and FlexCache × S3 AP feature dependency. Details in docs/ontap-integration-notes.md.


Stats

  • Files changed: 200+ (documentation, translations, shared modules, templates)
  • New documents: Partner/SI one-pager (JP/EN/KO/ZH-CN), cost calculator, customization guide, incident response playbook, demo mode guide, comparison alternatives, PoC Go/No-Go template
  • New shared modules: data_classification.py, human_review.py, schemas/events.py
  • Benchmark runs: 7 (128/256/512 MBps Internet × 2 file sizes + Lambda egress + SQS replay storm simulation)
  • Templates fixed: 5 (cfn-lint errors: RecursiveDeleteOption, SNSPublishMessagePolicy, Handler path)
  • Translations added: 20 files (FC1-FC6 ko/zh-CN + FC1/FC3 full 8-lang)
  • samconfig.toml.example: 24 patterns
  • Output JSON samples: 24 patterns
  • DEV.to articles updated: 6 (4 Update Notes + 2 Series changes)
  • AWS Support cases: 1 resolved (S3 AP ServiceUnavailable — throughput change related)
  • Operational discoveries: 1 (throughput change → S3 AP disruption, now resolved)
  • Cost savings: ~$346/month (v4-test-demo + FPolicy server + VPC Endpoints + EC2 停止)
  • SQS Replay Storm Simulation: 10,000 events, 0% loss in downstream SQS/consumer path

Who Should Care About Phase 14?

  • Partners and SIs get a one-pager for first conversations and recommended questions for each FC pattern
  • Operations teams learn that throughput capacity changes can disrupt S3 AP access
  • Architects get standardized benchmark methodology with hypothesis-driven testing
  • Developers get Presigned URL clarification — works but don't depend on it
  • Standard S3 bucket users learn where FSx for ONTAP S3 AP differs from S3 bucket semantics, especially presigned URLs, availability, and operational dependencies
  • Serverless-first teams learn where the serverless processing plane ends and FSx for ONTAP operational considerations begin

What You Can Do Today

Phase 14 delivers immediately usable assets:

  1. Use the Partner/SI one-pager for your next customer conversation about FSx for ONTAP + serverless
  2. Check the S3AP Compatibility Notes for the latest Presigned URL and troubleshooting guidance
  3. Plan throughput changes carefully — add S3 AP health checks to your maintenance runbook
  4. Use the Sizing Guidance tables (Sections 7-8) to set MaxConcurrency for your workload
  5. Review the S3 Bucket User Guide before porting existing S3 applications to FSx for ONTAP S3 AP
  6. Review the ONTAP Integration Notes before attaching S3 AP workflows to production SVMs and volumes

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Full series: FSx for ONTAP S3 Access Points on DEV.to

Top comments (0)