TL;DR
Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.
Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.
This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:
- Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
- Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
- Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
- Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
- Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
- Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
- SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
- FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
- Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
- Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
- S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints
Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.
Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS
The problem
FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.
The solution
A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:
graph LR
A[Auto-Expand Request] --> B{GuardrailMode?}
B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
C --> F[DynamoDB Tracking]
D --> F
E --> F
F --> G[CloudWatch EMF Metrics]
| Mode | Behavior on Check Failure | Behavior on DynamoDB Error |
|---|---|---|
DRY_RUN |
Log warning, allow action | Fail-open (allow) |
ENFORCE |
Block action, emit metric | Fail-closed (deny) |
BREAK_GLASS |
Skip all checks | SNS alert + audit log |
Core implementation
from shared.guardrails import CapacityGuardrail, GuardrailMode
guardrail = CapacityGuardrail() # Mode from GUARDRAIL_MODE env var
result = guardrail.check_and_execute(
action_type="volume_grow",
requested_gb=50.0,
execute_fn=my_grow_function,
volume_id="vol-abc123",
)
if result.allowed:
print(f"Action executed: {result.action_id}")
else:
print(f"Action denied: {result.reason}")
# Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active
Three safety checks (ENFORCE mode)
- Rate limit: Max 10 actions per day per action type
- Daily cap: Max 500 GB cumulative expansion per day
- Cooldown: 300-second minimum interval between actions
All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).
DynamoDB tracking schema
| Attribute | Type | Description |
|---|---|---|
pk |
String | Action type (e.g., volume_grow) |
sk |
String | Date (YYYY-MM-DD) |
daily_total_gb |
Number | Cumulative GB expanded today |
action_count |
Number | Number of actions today |
last_action_ts |
String | ISO timestamp of last action |
actions |
List | Audit trail of all actions |
ttl |
Number | 30-day auto-expiry |
BREAK_GLASS production considerations
In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.
2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation
The problem
ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.
The solution
A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:
sequenceDiagram
participant SM as Secrets Manager
participant Lambda as Rotation Lambda (VPC)
participant ONTAP as FSx ONTAP REST API
SM->>Lambda: Step 1: createSecret
Lambda->>SM: Generate new password, store as AWSPENDING
SM->>Lambda: Step 2: setSecret
Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
ONTAP-->>Lambda: 200 OK
SM->>Lambda: Step 3: testSecret
Lambda->>ONTAP: GET /api/cluster (using new password)
ONTAP-->>Lambda: 200 OK (cluster UUID returned)
SM->>Lambda: Step 4: finishSecret
Lambda->>SM: Promote AWSPENDING → AWSCURRENT
Key design decisions
- VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
- 90-day interval: Configurable via CloudFormation parameter
-
Validation: Step 3 (
testSecret) verifies the new password works by calling the ONTAP cluster API -
Rollback safety: If
testSecretfails, the old password remains as AWSCURRENT
Bugs discovered during live testing
Three bugs were found and fixed during the actual rotation execution:
-
AWSPENDING empty check:
createSecretmust handle the case whereget_secret_value(VersionStage='AWSPENDING')raisesResourceNotFoundException -
management_ip fallback: The Lambda must support both
management_ip(new) andontap_mgmt_ip(legacy) keys in the secret JSON -
Cluster UUID validation:
testSecretnow validates the response contains a validuuidfield, not just HTTP 200
Verification result
Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret): ✅ ONTAP password changed via REST API
Step 3 (testSecret): ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT
Operational note
Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).
For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.
3. Synthetic Monitoring — CloudWatch Synthetics Canary
The problem
The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.
The solution
A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:
- ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
- S3 Access Point Check: ListObjectsV2 against the S3AP alias
Critical finding: network-origin and endpoint configuration matter
During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.
This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.
In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:
| Check | Observed requirement in this environment | Result |
|---|---|---|
| ONTAP REST API | VPC-internal access to management LIF | ✅ Works |
| S3AP health check | Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy | ⚠️ Timed out from the initial VPC Canary configuration |
Solution: Split into two monitoring paths:
- ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
- S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)
This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.
Canary runtime version lesson
The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.
AWS builder lesson: VPC placement is a design choice
A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.
4. Capacity Forecasting — Linear Regression with stdlib Only
The problem
Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.
The solution
A Lambda function running on a daily EventBridge schedule:
- Fetches 30 days of FSx
StorageUsedmetrics from CloudWatch - Performs linear regression using only Python's
mathmodule (zero external dependencies) - Publishes
DaysUntilFullas a CloudWatch custom metric - Sends SNS alert when forecast drops below threshold (default: 30 days)
Linear regression implementation (stdlib only)
def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
"""Least-squares linear regression using only math module."""
n = len(data_points)
if n < 2:
raise ValueError("Need at least 2 data points for regression")
sum_x = sum_y = sum_xy = sum_x2 = 0.0
for x, y in data_points:
sum_x += x
sum_y += y
sum_xy += x * y
sum_x2 += x * x
denominator = n * sum_x2 - sum_x * sum_x
if abs(denominator) < 1e-10:
return (0.0, sum_y / n)
slope = (n * sum_xy - sum_x * sum_y) / denominator
intercept = (sum_y - slope * sum_x) / n
return (slope, intercept)
Edge cases handled
| Scenario | DaysUntilFull | Behavior |
|---|---|---|
| < 2 data points | -1 | Insufficient data, no prediction |
| slope ≤ 0 (shrinking/flat) | -1 | Never fills up |
| Already over capacity | 0 | Immediate alert |
| Very low usage (0.03%) | 169,374 | Normal — far future prediction |
Live verification
{
"days_until_full": 169374,
"current_usage_pct": 0.03,
"total_capacity_gb": 1024.0,
"growth_rate_gb_per_day": 0.006,
"forecast_date": "2490-02-06T06:26:42Z"
}
The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.
This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.
5. Data Lineage Tracking — DynamoDB with GSI
The problem
When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.
The solution
A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:
graph TD
subgraph "DynamoDB: fsxn-s3ap-data-lineage"
PK[PK: source_file_key<br/>SK: processing_timestamp]
GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
end
Q1[Query by file] -->|PK lookup| PK
Q2[Query by UC + time range] -->|GSI query| GSI
Q3[Query by execution ARN] -->|Scan + filter| PK
For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.
Integration helper (opt-in)
from shared.lineage import LineageTracker, LineageRecord
tracker = LineageTracker()
record = LineageRecord(
source_file_key="/vol1/legal/contracts/deal-001.pdf",
processing_timestamp="2026-05-16T14:30:45.123Z",
step_functions_execution_arn="arn:aws:states:...:execution:...",
uc_id="legal-compliance",
output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
status="success",
duration_ms=4523,
)
lineage_id = tracker.record(record)
Design principles
- Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
-
TTL: 365-day auto-expiry via DynamoDB TTL (configurable via
LINEAGE_TTL_DAYSenvironment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention) - Opt-in: UCs integrate by importing the helper — no mandatory coupling
- PAY_PER_REQUEST: No capacity planning needed for variable workloads
Future: compliance-grade lineage (v2)
For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:
| Field | Purpose |
|---|---|
input_checksum |
SHA-256 of source file for integrity verification |
output_checksum |
SHA-256 of generated output |
fpolicy_sequence_number |
ONTAP-assigned sequence for ordering |
policy_version |
FPolicy policy configuration version |
uc_template_version |
UC CloudFormation template version |
guardrail_mode |
Active guardrail mode at processing time |
retention_profile |
Retention class for compliance tiering |
For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.
6. Protobuf TCP Framing — Adaptive Reader
The problem
Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.
The solution
An adaptive ProtobufFrameReader that supports three framing modes:
graph TD
A[Incoming TCP Stream] --> B{FramingMode}
B -->|AUTO_DETECT| C[Probe first 4 bytes]
C -->|Valid uint32 length| D[LENGTH_PREFIXED]
C -->|Otherwise| E[FRAMELESS]
B -->|LENGTH_PREFIXED| D
B -->|FRAMELESS| E
D --> F[4-byte big-endian header → payload]
E --> G[varint-delimited → payload]
F --> H[Decoded Message]
G --> H
Three modes
| Mode | Wire Format | Use Case |
|---|---|---|
LENGTH_PREFIXED |
4-byte big-endian length + payload | XML mode (legacy) |
FRAMELESS |
varint-delimited protobuf | Protobuf mode (ONTAP 9.15.1+) |
AUTO_DETECT |
Probe first bytes, then lock mode | Unknown/mixed environments |
Auto-detection heuristic
async def _auto_detect_and_read(self) -> bytes | None:
"""Probe first 4 bytes to determine framing mode."""
peek = await self._reader.readexactly(4)
candidate_length = struct.unpack("!I", peek)[0]
if 0 < candidate_length <= self._max_message_size:
# Valid length header → LENGTH_PREFIXED
self._detected_mode = FramingMode.LENGTH_PREFIXED
payload = await self._reader.readexactly(candidate_length)
return payload
else:
# Not a valid length → FRAMELESS (varint-delimited)
self._detected_mode = FramingMode.FRAMELESS
self._buffer = peek
return await self._read_varint_delimited()
Safety features
- Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
- FramingError exception: Structured error with offset and raw data for debugging
-
Graceful EOF handling: Returns
Noneon connection close without raising
Integration with existing FPolicy server
from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2
# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)
Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.
Phase 13 protobuf validation scope
The following questions will be confirmed with NetApp support during live wire validation:
- Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
- Message boundary behavior under high throughput
- Keep-alive behavior in protobuf mode vs XML mode
- Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
- Mixed-mode migration path (XML → protobuf transition without event loss)
- Maximum message size guidance from ONTAP side
7. SLO Definition — 4 Targets with CloudWatch Dashboard
The problem
Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.
The solution
Four SLO targets covering the critical path of the event-driven pipeline:
| SLO | Metric | Target | SLO met when |
|---|---|---|---|
| Event Ingestion Latency | EventIngestionLatency_ms |
P99 < 5,000 ms | LessThanThreshold |
| Processing Success Rate | ProcessingSuccessRate_pct |
> 99.5% | GreaterThanThreshold |
| Reconnect Time | FPolicyReconnectTime_sec |
< 30 sec | LessThanThreshold |
| Replay Completion Time | ReplayCompletionTime_sec |
< 300 sec (5 min) | LessThanThreshold |
For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".
CloudWatch Dashboard
The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):
from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets
# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
status = "MET" if r.met else "VIOLATED"
print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")
# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")
Alarm-based violation detection
Each SLO has a corresponding CloudWatch Alarm:
| Alarm Name | State | Evaluation |
|---|---|---|
fsxn-s3ap-slo-ingestion-latency |
OK | 3 consecutive periods |
fsxn-s3ap-slo-success-rate |
OK | 3 consecutive periods |
fsxn-s3ap-slo-reconnect-time |
OK | 3 consecutive periods |
fsxn-s3ap-slo-replay-completion |
OK | 3 consecutive periods |
All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.
8. FPolicy Pipeline E2E Verification
The problem
Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.
The verification
sequenceDiagram
participant NFS as NFS Client (Bastion)
participant ONTAP as FSx for ONTAP
participant FP as FPolicy Server (Fargate)
participant SQS as SQS Queue
NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
FP->>FP: Parse event, extract metadata
FP->>SQS: SendMessage (JSON payload)
SQS-->>SQS: Message available for consumers
Timeline (actual observed)
| Time | Event | Detail |
|---|---|---|
| T+0s | TCP connection test | ONTAP → Fargate IP (10.0.128.98:9898) |
| T+10s | Session established | NEGO_REQ → NEGO_RESP handshake |
| T+12s | KEEP_ALIVE starts | 2-minute interval |
| T+30s | NFS file created | echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt |
| T+31s | NOTI_REQ received | FPolicy server receives file creation event |
| T+32s | SQS delivery | Event sent to SQS queue (FPolicy_Q) |
SQS message format
{
"event_type": "FILE_CREATE",
"svm_name": "FSxN_OnPre",
"volume_name": "vol1",
"file_path": "/vol1/test_fpolicy_event.txt",
"client_ip": "10.0.128.98",
"timestamp": "2026-05-16T08:45:32Z",
"session_id": 1,
"sequence_number": 1
}
IAM issue discovered and fixed
The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.
Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.
Event contract assumptions
The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:
- Duplicate events can occur (especially during Persistent Store replay)
- Delivery order is not guaranteed (confirmed in Section 9)
- Consumers must be idempotent
-
file_path + timestamp + sequence_numberserves as an idempotency key candidate - Replay events may arrive after newer events
- Schema versioning should be introduced before multi-UC production rollout
9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios
The problem
Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.
Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.
The test procedure
- Stop Fargate task (ECS
stop-task) - Create 5 files via NFS during downtime (
replay-test-1.txtthroughreplay-test-5.txt) - Wait for ECS service auto-recovery (new task launch)
- Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
- Verify all 5 events arrive in SQS
Results
| Metric | Value |
|---|---|
| Events generated during downtime | 5 |
| Events replayed to SQS | 5 |
| Lost events | 0 |
| Replay delivery order | 3, 1, 2, 5, 4 (non-sequential) |
| Replay completion time | ~30 seconds |
Key observation: Out-of-order replay
Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:
- Idempotency: Deduplicate by file path + timestamp
- Timestamp-based ordering: Sort by event timestamp, not arrival order
20-file burst validation
Additionally, a 20-file burst test confirmed zero event loss under higher load:
| Test | Files Created | Events Delivered | Loss |
|---|---|---|---|
| Replay (5 files) | 5 | 5 | 0 |
| Burst (20 files) | 20 | 20 | 0 |
Phase 13 replay storm metrics
The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:
| Metric | Purpose |
|---|---|
| Persistent Store volume usage before/after replay | Capacity planning for the store volume |
| Events queued vs events replayed | Completeness verification |
| Replay throughput (events/sec) | Performance baseline |
| Replay duration | SLO calibration |
| Out-of-order distance | Downstream buffer sizing |
| Duplicate events | Idempotency requirement validation |
| ONTAP EMS logs around disconnect/reconnect | Root cause correlation |
Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.
Operational framing: event durability as RPO/RTO
Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.
Phase 12 validation scope
| Scope | Phase 12 Assumption | Production Consideration |
|---|---|---|
| SVM | Single SVM validation | Multi-SVM needs per-SVM policy and Persistent Store planning |
| Volume | Test volume | Production volumes should be grouped by UC/event profile |
| Protocol | NFS-based E2E test | NFSv3/NFSv4.1/SMB replay validation remains Phase 13 |
| Event types | File create | Modify/delete/rename validation remains Phase 13 |
| FPolicy mode | Async non-mandatory | Required for Persistent Store (NetApp docs) |
10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests
The problem
Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.
The approach
Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:
| Property Group | Properties | Tests | Bugs Found |
|---|---|---|---|
| Protobuf Frame Reader | 5 (round-trip, max size, EOF, multi-message, auto-detect) | 18 | 1 |
| Capacity Guardrails | 4 (mode behavior, rate limit, daily cap, cooldown) | 14 | 1 |
| Data Lineage | 3 (record/query round-trip, GSI consistency, TTL) | 9 | 0 |
| SLO Evaluation | 2 (threshold comparison, no-data handling) | 6 | 1 |
| Capacity Forecast | 2 (regression accuracy, edge cases) | 6 | 0 |
| Total | 16 | 53 | 3 |
Bugs discovered
Protobuf reader:
AUTO_DETECTmode failed when the first 4 bytes happened to form a valid-looking length that exceededmax_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.Guardrails:
BREAK_GLASSmode didn't emit theGuardrailBypassmetric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation),
max(datapoints, key=lambda dp: dp["Timestamp"])was non-deterministic. Fix: add secondary sort by value.
Example property test
@given(messages=st.lists(
st.binary(min_size=1, max_size=1000),
min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
"""Property: LENGTH_PREFIXED encode → decode preserves all messages."""
stream_data = _make_length_prefixed_stream(messages)
reader = _make_stream_reader(stream_data)
frame_reader = ProtobufFrameReader(
reader=reader,
mode=FramingMode.LENGTH_PREFIXED,
max_message_size=max(len(m) for m in messages) + 1,
)
decoded = []
for _ in range(len(messages)):
msg = asyncio.run(frame_reader.read_message())
assert msg is not None
decoded.append(msg)
assert decoded == messages # Round-trip property
11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints
The critical finding
FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.
In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.
Multi-layer authorization model
graph TD
Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
AP -->|resource policy| FS{Layer 3: File System Identity}
FS -->|UNIX UID or AD user| Volume[ONTAP Volume]
IAM -.->|❌ Denied| Block1[Access Denied]
AP -.->|❌ Denied| Block2[Access Denied]
FS -.->|❌ No permission| Block3[Access Denied]
AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.
Correct IAM ARN format
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}
Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.
VPC network constraint (environment-specific observation)
| Access Pattern | Observed Result | Notes |
|---|---|---|
| VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint) | ⚠️ Timeout in this config | Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment |
| Internet → S3 AP (NetworkOrigin=Internet) | ✅ | Routes correctly with valid IAM credentials |
| VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC) | Supported per AWS docs; not verified in Phase 12 | Requires VPC-origin AP and matching endpoint policy |
| VPC Lambda → ONTAP REST API | ✅ | Direct management LIF access |
Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.
Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:
- Run outside VPC (with Internet access)
- Use NAT Gateway for outbound routing
- Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions
Write support and practical constraints
FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.
However, S3 Access Points are not full S3 buckets. Key constraints include:
- Maximum upload size: 5 GB
- Only
FSX_ONTAPstorage class - Only SSE-FSX encryption
- No ACLs (except
bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs
All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.
12. Cross-Project Feedback — Template Hardening
During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:
Security Group: SourceSecurityGroupId over CIDR
Before (broad):
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 9898
ToPort: 9898
CidrIp: "10.0.0.0/8"
After (precise):
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: !Ref FPolicyPort
ToPort: !Ref FPolicyPort
SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
Description: FPolicy TCP from FSxN SVM Security Group
This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.
ONTAP CLI: Deprecated vserver prefix
ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:
# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...
# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...
KMS Decrypt: When it's needed (and when it's not)
Added documentation clarifying SQS encryption behavior:
-
SqsManagedSseEnabled: true→ kms:Decrypt is NOT needed (transparent) -
KmsMasterKeyId: alias/aws/sqs→ kms:Decrypt IS needed
Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.
EC2 AMI: Removed redundant Docker install
ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.
Cpu/Memory: String type is intentional
Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.
13. What's Next — Phase 13 Outlook
Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:
- ✅ Capacity guardrails preventing runaway auto-scaling
- ✅ Automated secrets rotation on 90-day cycle
- ✅ Proactive capacity forecasting with daily predictions
- ✅ SLO-based observability with alarm-driven alerting
- ✅ Data lineage tracking for audit and debugging
- ✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
- ✅ Property-based testing catching real bugs
Ownership boundary
| Layer | Primary Owner | Examples |
|---|---|---|
| Shared event platform | Platform / storage team | FPolicy server, SQS queue, EventBridge bus, Persistent Store |
| ONTAP operations | Storage team | SVM, volume, FPolicy policy, Persistent Store capacity |
| Security operations | Security / platform team | Secrets rotation, BREAK_GLASS approval, IAM policies |
| Workload UC | Application / data team | Step Functions, UC routing rules, output destinations |
| Observability | Platform + workload teams | SLO dashboard, UC-specific alarms, runbooks |
Production Readiness Matrix
| Capability | Phase 12 Status | Remaining Work |
|---|---|---|
| Capacity Guardrails | Verified (DRY_RUN/ENFORCE/BREAK_GLASS) | Approval workflow optional |
| Secrets Rotation | 4-step rotation verified | Ensure all clients read from Secrets Manager |
| SLO Dashboard | Deployed, 4 alarms active | Runbooks and alarm response automation in Phase 13 |
| Persistent Store Replay | 5-event + 20-event scenarios verified | 1000+ replay storm testing |
| S3AP Monitoring | ONTAP health path verified | Split S3AP health check (VPC-external) |
| Protobuf Framing | Property/integration tested | Live ONTAP protobuf wire validation |
| Multi-account OAM | Stack deployed conditionally | Second-account validation |
| Production UC E2E | Pipeline verified to SQS delivery | Full TriggerMode=EVENT_DRIVEN UC flow |
| Cost Dashboard | Not yet deployed | Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation |
Phase 13 candidates
Operational readiness:
- Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
- SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
- Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior
Enterprise deployment:
- Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
- Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
-
Production UC end-to-end: Deploy a UC template with
TriggerMode=EVENT_DRIVENand verify the complete flow from NFS file creation through Step Functions execution to output generation
Protocol and cost:
-
Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate
AUTO_DETECTmode against real ONTAP protobuf traffic - Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics
Decision trees and operational guides:
- Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
- NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover
Cost model awareness
While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:
| Category | Cost Type | Driver |
|---|---|---|
| FPolicy server (Fargate/EC2) | Fixed baseline | Always-on listener |
| NAT Gateway | Fixed + per-GB | Required if VPC Lambda needs Internet-origin S3AP access |
| CloudWatch Synthetics | Per-canary-run | 5-minute interval = 8,640 runs/month |
| CloudWatch custom metrics + Logs | Per-metric + per-GB ingested | SLO metrics, FPolicy server logs |
| DynamoDB (lineage + guardrails) | Per-request (PAY_PER_REQUEST) | Event volume dependent |
| SQS / EventBridge | Per-message / per-event | Event volume dependent |
| Persistent Store volume | Per-GB provisioned | Sized for max queued events during downtime |
Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).
NetworkOrigin decision table
Based on AWS documentation, the following decision criteria apply:
Choose VPC-origin when:
- All consumers are Lambda/ECS/EC2 inside the same VPC
- Private connectivity is mandatory (no internet-routed path allowed)
- VPC endpoint policy is part of the security boundary
- Network restriction is built-in (cannot be accidentally misconfigured)
Choose Internet-origin when:
- External accounts or on-premises clients need access
- Consumers are outside the bound VPC
- Internet-routed access with IAM controls is acceptable
- Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC
| Factor | VPC-origin | Internet-origin |
|---|---|---|
| Network enforcement | Built-in explicit Deny for non-VPC traffic | Policy-based only |
| VPC endpoint required | Yes (Gateway or Interface in bound VPC) | Only if using aws:SourceVpc conditions |
| Multi-VPC access | Via Interface endpoint + peering/TGW to bound VPC | Via policy conditions |
| Change access scope | Must recreate access point | Update policy |
| On-premises access | Via Interface endpoint in bound VPC | Direct with IAM credentials |
| Cost implication | VPC endpoint (Gateway=free, Interface=hourly) | NAT Gateway if VPC Lambda needs access |
Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.
Phase 12 readiness by workload type
| Workload | Phase 12 Ready? | Notes |
|---|---|---|
| Controlled PoC / single-account | ✅ Ready | All core components verified |
| Low/moderate event volume (< 100 events/day) | ✅ Ready | 20-event burst validated |
| DRY_RUN guardrail validation | ✅ Ready | Safe to deploy immediately |
| Secrets rotation validation | ✅ Ready | 4-step rotation verified |
| High-volume replay storm (1000+ events) | ⏳ Phase 13 | Throughput curve and store capacity not yet measured |
| Multi-account production | ⏳ Phase 13 | OAM link deployed but second-account validation pending |
| Strict SLO operations requiring runbooks | ⏳ Phase 13 | Dashboard deployed, runbooks not yet written |
| Live protobuf production mode | ⏳ Phase 13 | Wire validation with NetApp support pending |
| Full EVENT_DRIVEN UC end-to-end | ⏳ Phase 13 | Pipeline verified to SQS, Step Functions flow pending |
Phase 13 runbook scope: first-response diagnostic bundle
For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:
# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>
# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>
# EMS logs for FPolicy events
event log show -messagename *fpolicy*
Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.
Deployed Infrastructure
7 CloudFormation stacks deployed and verified:
| Stack | Status | Purpose |
|---|---|---|
fsxn-phase12-guardrails-table |
CREATE_COMPLETE | DynamoDB tracking table |
fsxn-phase12-lineage-table |
CREATE_COMPLETE | Data lineage DynamoDB + GSI |
fsxn-phase12-slo-dashboard |
CREATE_COMPLETE | CloudWatch dashboard + 4 alarms |
fsxn-phase12-oam-link |
CREATE_COMPLETE | Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13) |
fsxn-phase12-capacity-forecast |
CREATE_COMPLETE | Lambda + EventBridge schedule |
fsxn-phase12-secrets-rotation |
CREATE_COMPLETE | VPC Lambda + rotation config |
fsxn-phase12-synthetic-monitoring |
CREATE_COMPLETE | Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13 |
Test Results Summary
| Category | Count | Type | Result |
|---|---|---|---|
| Unit Tests | 116 | Local (CI-reproducible) | ✅ All pass |
| Property Tests (Hypothesis) | 53 | Local (CI-reproducible) | ✅ All pass |
| CloudFormation Deployments | 7 stacks | AWS integration | ✅ All CREATE_COMPLETE |
| Lambda Invocations | 2 (forecast + rotation) | AWS integration | ✅ Successful |
| FPolicy E2E | 1 pipeline test | AWS manual verification | ✅ Event delivered |
| Replay E2E | 5 events | AWS manual verification | ✅ Zero loss |
| 20-file burst | 20 events | AWS manual verification | ✅ Zero loss |
| Bugs found (property testing) | 3 | Local (CI-reproducible) | ✅ All fixed |
NetApp-Specific Takeaways
For NetApp users and partners evaluating this pattern:
- FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
- S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
- NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
- ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
- Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.
The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.
Conclusion
Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.
The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.
With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.
Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11





Top comments (0)