Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 17

Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

#aws #amazonfsxfornetappontap #s3accesspoints #serverless

TL;DR

Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.

Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.

This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:

Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints

Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.

The solution

A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:

graph LR
    A[Auto-Expand Request] --> B{GuardrailMode?}
    B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
    B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
    B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
    C --> F[DynamoDB Tracking]
    D --> F
    E --> F
    F --> G[CloudWatch EMF Metrics]

Mode	Behavior on Check Failure	Behavior on DynamoDB Error
`DRY_RUN`	Log warning, allow action	Fail-open (allow)
`ENFORCE`	Block action, emit metric	Fail-closed (deny)
`BREAK_GLASS`	Skip all checks	SNS alert + audit log

Core implementation

from shared.guardrails import CapacityGuardrail, GuardrailMode

guardrail = CapacityGuardrail()  # Mode from GUARDRAIL_MODE env var

result = guardrail.check_and_execute(
    action_type="volume_grow",
    requested_gb=50.0,
    execute_fn=my_grow_function,
    volume_id="vol-abc123",
)

if result.allowed:
    print(f"Action executed: {result.action_id}")
else:
    print(f"Action denied: {result.reason}")
    # Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active

Three safety checks (ENFORCE mode)

Rate limit: Max 10 actions per day per action type
Daily cap: Max 500 GB cumulative expansion per day
Cooldown: 300-second minimum interval between actions

All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).

DynamoDB tracking schema

Attribute	Type	Description
`pk`	String	Action type (e.g., `volume_grow`)
`sk`	String	Date (`YYYY-MM-DD`)
`daily_total_gb`	Number	Cumulative GB expanded today
`action_count`	Number	Number of actions today
`last_action_ts`	String	ISO timestamp of last action
`actions`	List	Audit trail of all actions
`ttl`	Number	30-day auto-expiry

BREAK_GLASS production considerations

In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.

2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.

The solution

A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:

sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM->>Lambda: Step 1: createSecret
    Lambda->>SM: Generate new password, store as AWSPENDING

    SM->>Lambda: Step 2: setSecret
    Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP-->>Lambda: 200 OK

    SM->>Lambda: Step 3: testSecret
    Lambda->>ONTAP: GET /api/cluster (using new password)
    ONTAP-->>Lambda: 200 OK (cluster UUID returned)

    SM->>Lambda: Step 4: finishSecret
    Lambda->>SM: Promote AWSPENDING → AWSCURRENT

Key design decisions

VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
90-day interval: Configurable via CloudFormation parameter
Validation: Step 3 (testSecret) verifies the new password works by calling the ONTAP cluster API
Rollback safety: If testSecret fails, the old password remains as AWSCURRENT

Bugs discovered during live testing

Three bugs were found and fixed during the actual rotation execution:

AWSPENDING empty check: createSecret must handle the case where get_secret_value(VersionStage='AWSPENDING') raises ResourceNotFoundException
management_ip fallback: The Lambda must support both management_ip (new) and ontap_mgmt_ip (legacy) keys in the secret JSON
Cluster UUID validation: testSecret now validates the response contains a valid uuid field, not just HTTP 200

Verification result

Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT

Operational note

Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).

For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.

3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.

The solution

A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:

ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
S3 Access Point Check: ListObjectsV2 against the S3AP alias

Critical finding: network-origin and endpoint configuration matter

During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.

This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.

In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:

Check	Observed requirement in this environment	Result
ONTAP REST API	VPC-internal access to management LIF	✅ Works
S3AP health check	Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy	⚠️ Timed out from the initial VPC Canary configuration

Solution: Split into two monitoring paths:

ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)

This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.

Canary runtime version lesson

The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.

AWS builder lesson: VPC placement is a design choice

A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.

4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.

The solution

A Lambda function running on a daily EventBridge schedule:

Fetches 30 days of FSx StorageUsed metrics from CloudWatch
Performs linear regression using only Python's math module (zero external dependencies)
Publishes DaysUntilFull as a CloudWatch custom metric
Sends SNS alert when forecast drops below threshold (default: 30 days)

Linear regression implementation (stdlib only)

def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
    """Least-squares linear regression using only math module."""
    n = len(data_points)
    if n < 2:
        raise ValueError("Need at least 2 data points for regression")

    sum_x = sum_y = sum_xy = sum_x2 = 0.0
    for x, y in data_points:
        sum_x += x
        sum_y += y
        sum_xy += x * y
        sum_x2 += x * x

    denominator = n * sum_x2 - sum_x * sum_x
    if abs(denominator) < 1e-10:
        return (0.0, sum_y / n)

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n
    return (slope, intercept)

Edge cases handled

Scenario	DaysUntilFull	Behavior
< 2 data points	-1	Insufficient data, no prediction
slope ≤ 0 (shrinking/flat)	-1	Never fills up
Already over capacity	0	Immediate alert
Very low usage (0.03%)	169,374	Normal — far future prediction

Live verification

{
  "days_until_full": 169374,
  "current_usage_pct": 0.03,
  "total_capacity_gb": 1024.0,
  "growth_rate_gb_per_day": 0.006,
  "forecast_date": "2490-02-06T06:26:42Z"
}

The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.

This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.

5. Data Lineage Tracking — DynamoDB with GSI

The problem

When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.

The solution

A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:

graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key<br/>SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] -->|PK lookup| PK
    Q2[Query by UC + time range] -->|GSI query| GSI
    Q3[Query by execution ARN] -->|Scan + filter| PK

For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.

Integration helper (opt-in)

from shared.lineage import LineageTracker, LineageRecord

tracker = LineageTracker()
record = LineageRecord(
    source_file_key="/vol1/legal/contracts/deal-001.pdf",
    processing_timestamp="2026-05-16T14:30:45.123Z",
    step_functions_execution_arn="arn:aws:states:...:execution:...",
    uc_id="legal-compliance",
    output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
    status="success",
    duration_ms=4523,
)
lineage_id = tracker.record(record)

Design principles

Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
TTL: 365-day auto-expiry via DynamoDB TTL (configurable via LINEAGE_TTL_DAYS environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)
Opt-in: UCs integrate by importing the helper — no mandatory coupling
PAY_PER_REQUEST: No capacity planning needed for variable workloads

Future: compliance-grade lineage (v2)

For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:

Field	Purpose
`input_checksum`	SHA-256 of source file for integrity verification
`output_checksum`	SHA-256 of generated output
`fpolicy_sequence_number`	ONTAP-assigned sequence for ordering
`policy_version`	FPolicy policy configuration version
`uc_template_version`	UC CloudFormation template version
`guardrail_mode`	Active guardrail mode at processing time
`retention_profile`	Retention class for compliance tiering

For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.

6. Protobuf TCP Framing — Adaptive Reader

The problem

Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.

The solution

An adaptive ProtobufFrameReader that supports three framing modes:

graph TD
    A[Incoming TCP Stream] --> B{FramingMode}
    B -->|AUTO_DETECT| C[Probe first 4 bytes]
    C -->|Valid uint32 length| D[LENGTH_PREFIXED]
    C -->|Otherwise| E[FRAMELESS]
    B -->|LENGTH_PREFIXED| D
    B -->|FRAMELESS| E
    D --> F[4-byte big-endian header → payload]
    E --> G[varint-delimited → payload]
    F --> H[Decoded Message]
    G --> H

Three modes

Mode	Wire Format	Use Case
`LENGTH_PREFIXED`	4-byte big-endian length + payload	XML mode (legacy)
`FRAMELESS`	varint-delimited protobuf	Protobuf mode (ONTAP 9.15.1+)
`AUTO_DETECT`	Probe first bytes, then lock mode	Unknown/mixed environments

Auto-detection heuristic

async def _auto_detect_and_read(self) -> bytes | None:
    """Probe first 4 bytes to determine framing mode."""
    peek = await self._reader.readexactly(4)
    candidate_length = struct.unpack("!I", peek)[0]

    if 0 < candidate_length <= self._max_message_size:
        # Valid length header → LENGTH_PREFIXED
        self._detected_mode = FramingMode.LENGTH_PREFIXED
        payload = await self._reader.readexactly(candidate_length)
        return payload
    else:
        # Not a valid length → FRAMELESS (varint-delimited)
        self._detected_mode = FramingMode.FRAMELESS
        self._buffer = peek
        return await self._read_varint_delimited()

Safety features

Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
FramingError exception: Structured error with offset and raw data for debugging
Graceful EOF handling: Returns None on connection close without raising

Integration with existing FPolicy server

from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2

# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)

Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.

Phase 13 protobuf validation scope

The following questions will be confirmed with NetApp support during live wire validation:

Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
Message boundary behavior under high throughput
Keep-alive behavior in protobuf mode vs XML mode
Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
Mixed-mode migration path (XML → protobuf transition without event loss)
Maximum message size guidance from ONTAP side

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.

The solution

Four SLO targets covering the critical path of the event-driven pipeline:

SLO	Metric	Target	SLO met when
Event Ingestion Latency	`EventIngestionLatency_ms`	P99 < 5,000 ms	LessThanThreshold
Processing Success Rate	`ProcessingSuccessRate_pct`	> 99.5%	GreaterThanThreshold
Reconnect Time	`FPolicyReconnectTime_sec`	< 30 sec	LessThanThreshold
Replay Completion Time	`ReplayCompletionTime_sec`	< 300 sec (5 min)	LessThanThreshold

For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".

CloudWatch Dashboard

The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):

from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets

# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
    status = "MET" if r.met else "VIOLATED"
    print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")

# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")

Alarm-based violation detection

Each SLO has a corresponding CloudWatch Alarm:

Alarm Name	State	Evaluation
`fsxn-s3ap-slo-ingestion-latency`	OK	3 consecutive periods
`fsxn-s3ap-slo-success-rate`	OK	3 consecutive periods
`fsxn-s3ap-slo-reconnect-time`	OK	3 consecutive periods
`fsxn-s3ap-slo-replay-completion`	OK	3 consecutive periods

All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.

8. FPolicy Pipeline E2E Verification

The problem

Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.

The verification

sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
    ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
    FP->>FP: Parse event, extract metadata
    FP->>SQS: SendMessage (JSON payload)
    SQS-->>SQS: Message available for consumers

Timeline (actual observed)

Time	Event	Detail
T+0s	TCP connection test	ONTAP → Fargate IP (10.0.128.98:9898)
T+10s	Session established	NEGO_REQ → NEGO_RESP handshake
T+12s	KEEP_ALIVE starts	2-minute interval
T+30s	NFS file created	`echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt`
T+31s	NOTI_REQ received	FPolicy server receives file creation event
T+32s	SQS delivery	Event sent to SQS queue (FPolicy_Q)

SQS message format

{
  "event_type": "FILE_CREATE",
  "svm_name": "FSxN_OnPre",
  "volume_name": "vol1",
  "file_path": "/vol1/test_fpolicy_event.txt",
  "client_ip": "10.0.128.98",
  "timestamp": "2026-05-16T08:45:32Z",
  "session_id": 1,
  "sequence_number": 1
}

IAM issue discovered and fixed

The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.

Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.

Event contract assumptions

The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:

Duplicate events can occur (especially during Persistent Store replay)
Delivery order is not guaranteed (confirmed in Section 9)
Consumers must be idempotent
file_path + timestamp + sequence_number serves as an idempotency key candidate
Replay events may arrive after newer events
Schema versioning should be introduced before multi-UC production rollout

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.

Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.

The test procedure

Stop Fargate task (ECS stop-task)
Create 5 files via NFS during downtime (replay-test-1.txt through replay-test-5.txt)
Wait for ECS service auto-recovery (new task launch)
Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
Verify all 5 events arrive in SQS

Results

Metric	Value
Events generated during downtime	5
Events replayed to SQS	5
Lost events	0
Replay delivery order	3, 1, 2, 5, 4 (non-sequential)
Replay completion time	~30 seconds

Key observation: Out-of-order replay

Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:

Idempotency: Deduplicate by file path + timestamp
Timestamp-based ordering: Sort by event timestamp, not arrival order

20-file burst validation

Additionally, a 20-file burst test confirmed zero event loss under higher load:

Test	Files Created	Events Delivered	Loss
Replay (5 files)	5	5	0
Burst (20 files)	20	20	0

Phase 13 replay storm metrics

The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:

Metric	Purpose
Persistent Store volume usage before/after replay	Capacity planning for the store volume
Events queued vs events replayed	Completeness verification
Replay throughput (events/sec)	Performance baseline
Replay duration	SLO calibration
Out-of-order distance	Downstream buffer sizing
Duplicate events	Idempotency requirement validation
ONTAP EMS logs around disconnect/reconnect	Root cause correlation

Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.

Operational framing: event durability as RPO/RTO

Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.

Phase 12 validation scope

Scope	Phase 12 Assumption	Production Consideration
SVM	Single SVM validation	Multi-SVM needs per-SVM policy and Persistent Store planning
Volume	Test volume	Production volumes should be grouped by UC/event profile
Protocol	NFS-based E2E test	NFSv3/NFSv4.1/SMB replay validation remains Phase 13
Event types	File create	Modify/delete/rename validation remains Phase 13
FPolicy mode	Async non-mandatory	Required for Persistent Store (NetApp docs)

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.

The approach

Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:

Property Group	Properties	Tests	Bugs Found
Protobuf Frame Reader	5 (round-trip, max size, EOF, multi-message, auto-detect)	18	1
Capacity Guardrails	4 (mode behavior, rate limit, daily cap, cooldown)	14	1
Data Lineage	3 (record/query round-trip, GSI consistency, TTL)	9	0
SLO Evaluation	2 (threshold comparison, no-data handling)	6	1
Capacity Forecast	2 (regression accuracy, edge cases)	6	0
Total	16	53	3

Bugs discovered

Protobuf reader: AUTO_DETECT mode failed when the first 4 bytes happened to form a valid-looking length that exceeded max_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.
Guardrails: BREAK_GLASS mode didn't emit the GuardrailBypass metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.
SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), max(datapoints, key=lambda dp: dp["Timestamp"]) was non-deterministic. Fix: add secondary sort by value.

Example property test

@given(messages=st.lists(
    st.binary(min_size=1, max_size=1000),
    min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
    """Property: LENGTH_PREFIXED encode → decode preserves all messages."""
    stream_data = _make_length_prefixed_stream(messages)
    reader = _make_stream_reader(stream_data)
    frame_reader = ProtobufFrameReader(
        reader=reader,
        mode=FramingMode.LENGTH_PREFIXED,
        max_message_size=max(len(m) for m in messages) + 1,
    )

    decoded = []
    for _ in range(len(messages)):
        msg = asyncio.run(frame_reader.read_message())
        assert msg is not None
        decoded.append(msg)

    assert decoded == messages  # Round-trip property

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.

In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.

Multi-layer authorization model

graph TD
    Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
    IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP -->|resource policy| FS{Layer 3: File System Identity}
    FS -->|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.->|❌ Denied| Block1[Access Denied]
    AP -.->|❌ Denied| Block2[Access Denied]
    FS -.->|❌ No permission| Block3[Access Denied]

AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.

Correct IAM ARN format

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}

Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.

VPC network constraint (environment-specific observation)

Access Pattern	Observed Result	Notes
VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)	⚠️ Timeout in this config	Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment
Internet → S3 AP (NetworkOrigin=Internet)	✅	Routes correctly with valid IAM credentials
VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)	Supported per AWS docs; not verified in Phase 12	Requires VPC-origin AP and matching endpoint policy
VPC Lambda → ONTAP REST API	✅	Direct management LIF access

Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.

Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:

Run outside VPC (with Internet access)
Use NAT Gateway for outbound routing
Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions

Write support and practical constraints

FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.

However, S3 Access Points are not full S3 buckets. Key constraints include:

Maximum upload size: 5 GB
Only FSX_ONTAP storage class
Only SSE-FSX encryption
No ACLs (except bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs

All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.

12. Cross-Project Feedback — Template Hardening

During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:

Security Group: SourceSecurityGroupId over CIDR

Before (broad):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 9898
    ToPort: 9898
    CidrIp: "10.0.0.0/8"

After (precise):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: !Ref FPolicyPort
    ToPort: !Ref FPolicyPort
    SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
    Description: FPolicy TCP from FSxN SVM Security Group

This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.

ONTAP CLI: Deprecated `vserver` prefix

ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:

# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...

# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...

KMS Decrypt: When it's needed (and when it's not)

Added documentation clarifying SQS encryption behavior:

SqsManagedSseEnabled: true → kms:Decrypt is NOT needed (transparent)
KmsMasterKeyId: alias/aws/sqs → kms:Decrypt IS needed

Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.

EC2 AMI: Removed redundant Docker install

ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.

Cpu/Memory: String type is intentional

Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.

13. What's Next — Phase 13 Outlook

Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:

✅ Capacity guardrails preventing runaway auto-scaling
✅ Automated secrets rotation on 90-day cycle
✅ Proactive capacity forecasting with daily predictions
✅ SLO-based observability with alarm-driven alerting
✅ Data lineage tracking for audit and debugging
✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
✅ Property-based testing catching real bugs

Ownership boundary

Layer	Primary Owner	Examples
Shared event platform	Platform / storage team	FPolicy server, SQS queue, EventBridge bus, Persistent Store
ONTAP operations	Storage team	SVM, volume, FPolicy policy, Persistent Store capacity
Security operations	Security / platform team	Secrets rotation, BREAK_GLASS approval, IAM policies
Workload UC	Application / data team	Step Functions, UC routing rules, output destinations
Observability	Platform + workload teams	SLO dashboard, UC-specific alarms, runbooks

Production Readiness Matrix

Capability	Phase 12 Status	Remaining Work
Capacity Guardrails	Verified (DRY_RUN/ENFORCE/BREAK_GLASS)	Approval workflow optional
Secrets Rotation	4-step rotation verified	Ensure all clients read from Secrets Manager
SLO Dashboard	Deployed, 4 alarms active	Runbooks and alarm response automation in Phase 13
Persistent Store Replay	5-event + 20-event scenarios verified	1000+ replay storm testing
S3AP Monitoring	ONTAP health path verified	Split S3AP health check (VPC-external)
Protobuf Framing	Property/integration tested	Live ONTAP protobuf wire validation
Multi-account OAM	Stack deployed conditionally	Second-account validation
Production UC E2E	Pipeline verified to SQS delivery	Full TriggerMode=EVENT_DRIVEN UC flow
Cost Dashboard	Not yet deployed	Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation

Phase 13 candidates

Operational readiness:

Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior

Enterprise deployment:

Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
Production UC end-to-end: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify the complete flow from NFS file creation through Step Functions execution to output generation

Protocol and cost:

Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate AUTO_DETECT mode against real ONTAP protobuf traffic
Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics

Decision trees and operational guides:

Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover

Cost model awareness

While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:

Category	Cost Type	Driver
FPolicy server (Fargate/EC2)	Fixed baseline	Always-on listener
NAT Gateway	Fixed + per-GB	Required if VPC Lambda needs Internet-origin S3AP access
CloudWatch Synthetics	Per-canary-run	5-minute interval = 8,640 runs/month
CloudWatch custom metrics + Logs	Per-metric + per-GB ingested	SLO metrics, FPolicy server logs
DynamoDB (lineage + guardrails)	Per-request (PAY_PER_REQUEST)	Event volume dependent
SQS / EventBridge	Per-message / per-event	Event volume dependent
Persistent Store volume	Per-GB provisioned	Sized for max queued events during downtime

Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).

NetworkOrigin decision table

Based on AWS documentation, the following decision criteria apply:

Choose VPC-origin when:

All consumers are Lambda/ECS/EC2 inside the same VPC
Private connectivity is mandatory (no internet-routed path allowed)
VPC endpoint policy is part of the security boundary
Network restriction is built-in (cannot be accidentally misconfigured)

Choose Internet-origin when:

External accounts or on-premises clients need access
Consumers are outside the bound VPC
Internet-routed access with IAM controls is acceptable
Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC

Factor	VPC-origin	Internet-origin
Network enforcement	Built-in explicit Deny for non-VPC traffic	Policy-based only
VPC endpoint required	Yes (Gateway or Interface in bound VPC)	Only if using `aws:SourceVpc` conditions
Multi-VPC access	Via Interface endpoint + peering/TGW to bound VPC	Via policy conditions
Change access scope	Must recreate access point	Update policy
On-premises access	Via Interface endpoint in bound VPC	Direct with IAM credentials
Cost implication	VPC endpoint (Gateway=free, Interface=hourly)	NAT Gateway if VPC Lambda needs access

Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.

Phase 12 readiness by workload type

Workload	Phase 12 Ready?	Notes
Controlled PoC / single-account	✅ Ready	All core components verified
Low/moderate event volume (< 100 events/day)	✅ Ready	20-event burst validated
DRY_RUN guardrail validation	✅ Ready	Safe to deploy immediately
Secrets rotation validation	✅ Ready	4-step rotation verified
High-volume replay storm (1000+ events)	⏳ Phase 13	Throughput curve and store capacity not yet measured
Multi-account production	⏳ Phase 13	OAM link deployed but second-account validation pending
Strict SLO operations requiring runbooks	⏳ Phase 13	Dashboard deployed, runbooks not yet written
Live protobuf production mode	⏳ Phase 13	Wire validation with NetApp support pending
Full EVENT_DRIVEN UC end-to-end	⏳ Phase 13	Pipeline verified to SQS, Step Functions flow pending

Phase 13 runbook scope: first-response diagnostic bundle

For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:

# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>

# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>

# EMS logs for FPolicy events
event log show -messagename *fpolicy*

Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.

Deployed Infrastructure

7 CloudFormation stacks deployed and verified:

Stack	Status	Purpose
`fsxn-phase12-guardrails-table`	CREATE_COMPLETE	DynamoDB tracking table
`fsxn-phase12-lineage-table`	CREATE_COMPLETE	Data lineage DynamoDB + GSI
`fsxn-phase12-slo-dashboard`	CREATE_COMPLETE	CloudWatch dashboard + 4 alarms
`fsxn-phase12-oam-link`	CREATE_COMPLETE	Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)
`fsxn-phase12-capacity-forecast`	CREATE_COMPLETE	Lambda + EventBridge schedule
`fsxn-phase12-secrets-rotation`	CREATE_COMPLETE	VPC Lambda + rotation config
`fsxn-phase12-synthetic-monitoring`	CREATE_COMPLETE	Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13

Test Results Summary

Category	Count	Type	Result
Unit Tests	116	Local (CI-reproducible)	✅ All pass
Property Tests (Hypothesis)	53	Local (CI-reproducible)	✅ All pass
CloudFormation Deployments	7 stacks	AWS integration	✅ All CREATE_COMPLETE
Lambda Invocations	2 (forecast + rotation)	AWS integration	✅ Successful
FPolicy E2E	1 pipeline test	AWS manual verification	✅ Event delivered
Replay E2E	5 events	AWS manual verification	✅ Zero loss
20-file burst	20 events	AWS manual verification	✅ Zero loss
Bugs found (property testing)	3	Local (CI-reproducible)	✅ All fixed

NetApp-Specific Takeaways

For NetApp users and partners evaluating this pattern:

FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.

The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.

Conclusion

Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.

The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.

With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11

TL;DR

1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

The solution

Core implementation

Three safety checks (ENFORCE mode)

DynamoDB tracking schema

BREAK_GLASS production considerations

2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

The solution

Key design decisions

Bugs discovered during live testing

Verification result

Operational note

3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The solution

Critical finding: network-origin and endpoint configuration matter

Canary runtime version lesson

AWS builder lesson: VPC placement is a design choice

4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

The solution

Linear regression implementation (stdlib only)

Edge cases handled

Live verification

5. Data Lineage Tracking — DynamoDB with GSI

The problem

The solution

Integration helper (opt-in)

Design principles

Future: compliance-grade lineage (v2)

6. Protobuf TCP Framing — Adaptive Reader

The problem

The solution

Three modes

Auto-detection heuristic

Safety features

Integration with existing FPolicy server

Phase 13 protobuf validation scope

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

The solution

CloudWatch Dashboard

Alarm-based violation detection

8. FPolicy Pipeline E2E Verification

The problem

The verification

Timeline (actual observed)

SQS message format

IAM issue discovered and fixed

Event contract assumptions

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

The test procedure

Results

Key observation: Out-of-order replay

20-file burst validation

Phase 13 replay storm metrics

Operational framing: event durability as RPO/RTO

Phase 12 validation scope

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

The approach

Bugs discovered

Example property test

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

Multi-layer authorization model

Correct IAM ARN format

VPC network constraint (environment-specific observation)

Write support and practical constraints

12. Cross-Project Feedback — Template Hardening

Security Group: SourceSecurityGroupId over CIDR

ONTAP CLI: Deprecated vserver prefix

KMS Decrypt: When it's needed (and when it's not)

EC2 AMI: Removed redundant Docker install

Cpu/Memory: String type is intentional

13. What's Next — Phase 13 Outlook

ONTAP CLI: Deprecated `vserver` prefix