DEV Community

Cover image for Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

TL;DR

Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.

Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.

This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:

  • Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
  • Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
  • Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
  • Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
  • Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
  • Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
  • SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
  • FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
  • Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
  • Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
  • S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints

Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.

The solution

A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:

graph LR
    A[Auto-Expand Request] --> B{GuardrailMode?}
    B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
    B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
    B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
    C --> F[DynamoDB Tracking]
    D --> F
    E --> F
    F --> G[CloudWatch EMF Metrics]
Enter fullscreen mode Exit fullscreen mode
Mode Behavior on Check Failure Behavior on DynamoDB Error
DRY_RUN Log warning, allow action Fail-open (allow)
ENFORCE Block action, emit metric Fail-closed (deny)
BREAK_GLASS Skip all checks SNS alert + audit log

Core implementation

from shared.guardrails import CapacityGuardrail, GuardrailMode

guardrail = CapacityGuardrail()  # Mode from GUARDRAIL_MODE env var

result = guardrail.check_and_execute(
    action_type="volume_grow",
    requested_gb=50.0,
    execute_fn=my_grow_function,
    volume_id="vol-abc123",
)

if result.allowed:
    print(f"Action executed: {result.action_id}")
else:
    print(f"Action denied: {result.reason}")
    # Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active
Enter fullscreen mode Exit fullscreen mode

Three safety checks (ENFORCE mode)

  1. Rate limit: Max 10 actions per day per action type
  2. Daily cap: Max 500 GB cumulative expansion per day
  3. Cooldown: 300-second minimum interval between actions

All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).

DynamoDB tracking schema

Attribute Type Description
pk String Action type (e.g., volume_grow)
sk String Date (YYYY-MM-DD)
daily_total_gb Number Cumulative GB expanded today
action_count Number Number of actions today
last_action_ts String ISO timestamp of last action
actions List Audit trail of all actions
ttl Number 30-day auto-expiry

DynamoDB Guardrails Table

BREAK_GLASS production considerations

In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.


2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.

The solution

A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:

sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM->>Lambda: Step 1: createSecret
    Lambda->>SM: Generate new password, store as AWSPENDING

    SM->>Lambda: Step 2: setSecret
    Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP-->>Lambda: 200 OK

    SM->>Lambda: Step 3: testSecret
    Lambda->>ONTAP: GET /api/cluster (using new password)
    ONTAP-->>Lambda: 200 OK (cluster UUID returned)

    SM->>Lambda: Step 4: finishSecret
    Lambda->>SM: Promote AWSPENDING → AWSCURRENT
Enter fullscreen mode Exit fullscreen mode

Key design decisions

  • VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
  • 90-day interval: Configurable via CloudFormation parameter
  • Validation: Step 3 (testSecret) verifies the new password works by calling the ONTAP cluster API
  • Rollback safety: If testSecret fails, the old password remains as AWSCURRENT

Bugs discovered during live testing

Three bugs were found and fixed during the actual rotation execution:

  1. AWSPENDING empty check: createSecret must handle the case where get_secret_value(VersionStage='AWSPENDING') raises ResourceNotFoundException
  2. management_ip fallback: The Lambda must support both management_ip (new) and ontap_mgmt_ip (legacy) keys in the secret JSON
  3. Cluster UUID validation: testSecret now validates the response contains a valid uuid field, not just HTTP 200

Verification result

Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT
Enter fullscreen mode Exit fullscreen mode

Operational note

Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).

For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.


3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.

The solution

A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:

  1. ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
  2. S3 Access Point Check: ListObjectsV2 against the S3AP alias

Critical finding: network-origin and endpoint configuration matter

During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.

This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.

In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:

Check Observed requirement in this environment Result
ONTAP REST API VPC-internal access to management LIF ✅ Works
S3AP health check Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy ⚠️ Timed out from the initial VPC Canary configuration

Solution: Split into two monitoring paths:

  • ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
  • S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)

This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.

Canary runtime version lesson

The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.

AWS builder lesson: VPC placement is a design choice

A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.

Synthetics Canary


4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.

The solution

A Lambda function running on a daily EventBridge schedule:

  1. Fetches 30 days of FSx StorageUsed metrics from CloudWatch
  2. Performs linear regression using only Python's math module (zero external dependencies)
  3. Publishes DaysUntilFull as a CloudWatch custom metric
  4. Sends SNS alert when forecast drops below threshold (default: 30 days)

Linear regression implementation (stdlib only)

def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
    """Least-squares linear regression using only math module."""
    n = len(data_points)
    if n < 2:
        raise ValueError("Need at least 2 data points for regression")

    sum_x = sum_y = sum_xy = sum_x2 = 0.0
    for x, y in data_points:
        sum_x += x
        sum_y += y
        sum_xy += x * y
        sum_x2 += x * x

    denominator = n * sum_x2 - sum_x * sum_x
    if abs(denominator) < 1e-10:
        return (0.0, sum_y / n)

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n
    return (slope, intercept)
Enter fullscreen mode Exit fullscreen mode

Edge cases handled

Scenario DaysUntilFull Behavior
< 2 data points -1 Insufficient data, no prediction
slope ≤ 0 (shrinking/flat) -1 Never fills up
Already over capacity 0 Immediate alert
Very low usage (0.03%) 169,374 Normal — far future prediction

Live verification

{
  "days_until_full": 169374,
  "current_usage_pct": 0.03,
  "total_capacity_gb": 1024.0,
  "growth_rate_gb_per_day": 0.006,
  "forecast_date": "2490-02-06T06:26:42Z"
}
Enter fullscreen mode Exit fullscreen mode

The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.

This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.

Capacity Forecast Lambda


5. Data Lineage Tracking — DynamoDB with GSI

The problem

When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.

The solution

A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:

graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key<br/>SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] -->|PK lookup| PK
    Q2[Query by UC + time range] -->|GSI query| GSI
    Q3[Query by execution ARN] -->|Scan + filter| PK
Enter fullscreen mode Exit fullscreen mode

For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.

Integration helper (opt-in)

from shared.lineage import LineageTracker, LineageRecord

tracker = LineageTracker()
record = LineageRecord(
    source_file_key="/vol1/legal/contracts/deal-001.pdf",
    processing_timestamp="2026-05-16T14:30:45.123Z",
    step_functions_execution_arn="arn:aws:states:...:execution:...",
    uc_id="legal-compliance",
    output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
    status="success",
    duration_ms=4523,
)
lineage_id = tracker.record(record)
Enter fullscreen mode Exit fullscreen mode

Design principles

  • Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
  • TTL: 365-day auto-expiry via DynamoDB TTL (configurable via LINEAGE_TTL_DAYS environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)
  • Opt-in: UCs integrate by importing the helper — no mandatory coupling
  • PAY_PER_REQUEST: No capacity planning needed for variable workloads

Future: compliance-grade lineage (v2)

For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:

Field Purpose
input_checksum SHA-256 of source file for integrity verification
output_checksum SHA-256 of generated output
fpolicy_sequence_number ONTAP-assigned sequence for ordering
policy_version FPolicy policy configuration version
uc_template_version UC CloudFormation template version
guardrail_mode Active guardrail mode at processing time
retention_profile Retention class for compliance tiering

For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.


6. Protobuf TCP Framing — Adaptive Reader

The problem

Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.

The solution

An adaptive ProtobufFrameReader that supports three framing modes:

graph TD
    A[Incoming TCP Stream] --> B{FramingMode}
    B -->|AUTO_DETECT| C[Probe first 4 bytes]
    C -->|Valid uint32 length| D[LENGTH_PREFIXED]
    C -->|Otherwise| E[FRAMELESS]
    B -->|LENGTH_PREFIXED| D
    B -->|FRAMELESS| E
    D --> F[4-byte big-endian header → payload]
    E --> G[varint-delimited → payload]
    F --> H[Decoded Message]
    G --> H
Enter fullscreen mode Exit fullscreen mode

Three modes

Mode Wire Format Use Case
LENGTH_PREFIXED 4-byte big-endian length + payload XML mode (legacy)
FRAMELESS varint-delimited protobuf Protobuf mode (ONTAP 9.15.1+)
AUTO_DETECT Probe first bytes, then lock mode Unknown/mixed environments

Auto-detection heuristic

async def _auto_detect_and_read(self) -> bytes | None:
    """Probe first 4 bytes to determine framing mode."""
    peek = await self._reader.readexactly(4)
    candidate_length = struct.unpack("!I", peek)[0]

    if 0 < candidate_length <= self._max_message_size:
        # Valid length header → LENGTH_PREFIXED
        self._detected_mode = FramingMode.LENGTH_PREFIXED
        payload = await self._reader.readexactly(candidate_length)
        return payload
    else:
        # Not a valid length → FRAMELESS (varint-delimited)
        self._detected_mode = FramingMode.FRAMELESS
        self._buffer = peek
        return await self._read_varint_delimited()
Enter fullscreen mode Exit fullscreen mode

Safety features

  • Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
  • FramingError exception: Structured error with offset and raw data for debugging
  • Graceful EOF handling: Returns None on connection close without raising

Integration with existing FPolicy server

from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2

# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)
Enter fullscreen mode Exit fullscreen mode

Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.

Phase 13 protobuf validation scope

The following questions will be confirmed with NetApp support during live wire validation:

  • Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
  • Message boundary behavior under high throughput
  • Keep-alive behavior in protobuf mode vs XML mode
  • Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
  • Mixed-mode migration path (XML → protobuf transition without event loss)
  • Maximum message size guidance from ONTAP side

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.

The solution

Four SLO targets covering the critical path of the event-driven pipeline:

SLO Metric Target SLO met when
Event Ingestion Latency EventIngestionLatency_ms P99 < 5,000 ms LessThanThreshold
Processing Success Rate ProcessingSuccessRate_pct > 99.5% GreaterThanThreshold
Reconnect Time FPolicyReconnectTime_sec < 30 sec LessThanThreshold
Replay Completion Time ReplayCompletionTime_sec < 300 sec (5 min) LessThanThreshold

For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".

CloudWatch Dashboard

The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):

from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets

# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
    status = "MET" if r.met else "VIOLATED"
    print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")

# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")
Enter fullscreen mode Exit fullscreen mode

Alarm-based violation detection

Each SLO has a corresponding CloudWatch Alarm:

Alarm Name State Evaluation
fsxn-s3ap-slo-ingestion-latency OK 3 consecutive periods
fsxn-s3ap-slo-success-rate OK 3 consecutive periods
fsxn-s3ap-slo-reconnect-time OK 3 consecutive periods
fsxn-s3ap-slo-replay-completion OK 3 consecutive periods

All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.

SLO Dashboard


8. FPolicy Pipeline E2E Verification

The problem

Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.

The verification

sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
    ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
    FP->>FP: Parse event, extract metadata
    FP->>SQS: SendMessage (JSON payload)
    SQS-->>SQS: Message available for consumers
Enter fullscreen mode Exit fullscreen mode

Timeline (actual observed)

Time Event Detail
T+0s TCP connection test ONTAP → Fargate IP (10.0.128.98:9898)
T+10s Session established NEGO_REQ → NEGO_RESP handshake
T+12s KEEP_ALIVE starts 2-minute interval
T+30s NFS file created echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt
T+31s NOTI_REQ received FPolicy server receives file creation event
T+32s SQS delivery Event sent to SQS queue (FPolicy_Q)

SQS message format

{
  "event_type": "FILE_CREATE",
  "svm_name": "FSxN_OnPre",
  "volume_name": "vol1",
  "file_path": "/vol1/test_fpolicy_event.txt",
  "client_ip": "10.0.128.98",
  "timestamp": "2026-05-16T08:45:32Z",
  "session_id": 1,
  "sequence_number": 1
}
Enter fullscreen mode Exit fullscreen mode

IAM issue discovered and fixed

The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.

Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.

Event contract assumptions

The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:

  • Duplicate events can occur (especially during Persistent Store replay)
  • Delivery order is not guaranteed (confirmed in Section 9)
  • Consumers must be idempotent
  • file_path + timestamp + sequence_number serves as an idempotency key candidate
  • Replay events may arrive after newer events
  • Schema versioning should be introduced before multi-UC production rollout

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.

Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.

The test procedure

  1. Stop Fargate task (ECS stop-task)
  2. Create 5 files via NFS during downtime (replay-test-1.txt through replay-test-5.txt)
  3. Wait for ECS service auto-recovery (new task launch)
  4. Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
  5. Verify all 5 events arrive in SQS

Results

Metric Value
Events generated during downtime 5
Events replayed to SQS 5
Lost events 0
Replay delivery order 3, 1, 2, 5, 4 (non-sequential)
Replay completion time ~30 seconds

Key observation: Out-of-order replay

Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:

  • Idempotency: Deduplicate by file path + timestamp
  • Timestamp-based ordering: Sort by event timestamp, not arrival order

20-file burst validation

Additionally, a 20-file burst test confirmed zero event loss under higher load:

Test Files Created Events Delivered Loss
Replay (5 files) 5 5 0
Burst (20 files) 20 20 0

Phase 13 replay storm metrics

The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:

Metric Purpose
Persistent Store volume usage before/after replay Capacity planning for the store volume
Events queued vs events replayed Completeness verification
Replay throughput (events/sec) Performance baseline
Replay duration SLO calibration
Out-of-order distance Downstream buffer sizing
Duplicate events Idempotency requirement validation
ONTAP EMS logs around disconnect/reconnect Root cause correlation

Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.

Operational framing: event durability as RPO/RTO

Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.

Phase 12 validation scope

Scope Phase 12 Assumption Production Consideration
SVM Single SVM validation Multi-SVM needs per-SVM policy and Persistent Store planning
Volume Test volume Production volumes should be grouped by UC/event profile
Protocol NFS-based E2E test NFSv3/NFSv4.1/SMB replay validation remains Phase 13
Event types File create Modify/delete/rename validation remains Phase 13
FPolicy mode Async non-mandatory Required for Persistent Store (NetApp docs)

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.

The approach

Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:

Property Group Properties Tests Bugs Found
Protobuf Frame Reader 5 (round-trip, max size, EOF, multi-message, auto-detect) 18 1
Capacity Guardrails 4 (mode behavior, rate limit, daily cap, cooldown) 14 1
Data Lineage 3 (record/query round-trip, GSI consistency, TTL) 9 0
SLO Evaluation 2 (threshold comparison, no-data handling) 6 1
Capacity Forecast 2 (regression accuracy, edge cases) 6 0
Total 16 53 3

Bugs discovered

  1. Protobuf reader: AUTO_DETECT mode failed when the first 4 bytes happened to form a valid-looking length that exceeded max_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.

  2. Guardrails: BREAK_GLASS mode didn't emit the GuardrailBypass metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.

  3. SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), max(datapoints, key=lambda dp: dp["Timestamp"]) was non-deterministic. Fix: add secondary sort by value.

Example property test

@given(messages=st.lists(
    st.binary(min_size=1, max_size=1000),
    min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
    """Property: LENGTH_PREFIXED encode → decode preserves all messages."""
    stream_data = _make_length_prefixed_stream(messages)
    reader = _make_stream_reader(stream_data)
    frame_reader = ProtobufFrameReader(
        reader=reader,
        mode=FramingMode.LENGTH_PREFIXED,
        max_message_size=max(len(m) for m in messages) + 1,
    )

    decoded = []
    for _ in range(len(messages)):
        msg = asyncio.run(frame_reader.read_message())
        assert msg is not None
        decoded.append(msg)

    assert decoded == messages  # Round-trip property
Enter fullscreen mode Exit fullscreen mode

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.

In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.

Multi-layer authorization model

graph TD
    Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
    IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP -->|resource policy| FS{Layer 3: File System Identity}
    FS -->|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.->|❌ Denied| Block1[Access Denied]
    AP -.->|❌ Denied| Block2[Access Denied]
    FS -.->|❌ No permission| Block3[Access Denied]
Enter fullscreen mode Exit fullscreen mode

AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.

Correct IAM ARN format

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}
Enter fullscreen mode Exit fullscreen mode

Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.

VPC network constraint (environment-specific observation)

Access Pattern Observed Result Notes
VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint) ⚠️ Timeout in this config Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment
Internet → S3 AP (NetworkOrigin=Internet) Routes correctly with valid IAM credentials
VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC) Supported per AWS docs; not verified in Phase 12 Requires VPC-origin AP and matching endpoint policy
VPC Lambda → ONTAP REST API Direct management LIF access

Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.

Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:

  • Run outside VPC (with Internet access)
  • Use NAT Gateway for outbound routing
  • Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions

Write support and practical constraints

FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.

However, S3 Access Points are not full S3 buckets. Key constraints include:

  • Maximum upload size: 5 GB
  • Only FSX_ONTAP storage class
  • Only SSE-FSX encryption
  • No ACLs (except bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs

All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.


12. Cross-Project Feedback — Template Hardening

During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:

Security Group: SourceSecurityGroupId over CIDR

Before (broad):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 9898
    ToPort: 9898
    CidrIp: "10.0.0.0/8"
Enter fullscreen mode Exit fullscreen mode

After (precise):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: !Ref FPolicyPort
    ToPort: !Ref FPolicyPort
    SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
    Description: FPolicy TCP from FSxN SVM Security Group
Enter fullscreen mode Exit fullscreen mode

This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.

ONTAP CLI: Deprecated vserver prefix

ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:

# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...

# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...
Enter fullscreen mode Exit fullscreen mode

KMS Decrypt: When it's needed (and when it's not)

Added documentation clarifying SQS encryption behavior:

  • SqsManagedSseEnabled: true → kms:Decrypt is NOT needed (transparent)
  • KmsMasterKeyId: alias/aws/sqs → kms:Decrypt IS needed

Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.

EC2 AMI: Removed redundant Docker install

ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.

Cpu/Memory: String type is intentional

Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.


13. What's Next — Phase 13 Outlook

Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:

  • ✅ Capacity guardrails preventing runaway auto-scaling
  • ✅ Automated secrets rotation on 90-day cycle
  • ✅ Proactive capacity forecasting with daily predictions
  • ✅ SLO-based observability with alarm-driven alerting
  • ✅ Data lineage tracking for audit and debugging
  • ✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
  • ✅ Property-based testing catching real bugs

Ownership boundary

Layer Primary Owner Examples
Shared event platform Platform / storage team FPolicy server, SQS queue, EventBridge bus, Persistent Store
ONTAP operations Storage team SVM, volume, FPolicy policy, Persistent Store capacity
Security operations Security / platform team Secrets rotation, BREAK_GLASS approval, IAM policies
Workload UC Application / data team Step Functions, UC routing rules, output destinations
Observability Platform + workload teams SLO dashboard, UC-specific alarms, runbooks

Production Readiness Matrix

Capability Phase 12 Status Remaining Work
Capacity Guardrails Verified (DRY_RUN/ENFORCE/BREAK_GLASS) Approval workflow optional
Secrets Rotation 4-step rotation verified Ensure all clients read from Secrets Manager
SLO Dashboard Deployed, 4 alarms active Runbooks and alarm response automation in Phase 13
Persistent Store Replay 5-event + 20-event scenarios verified 1000+ replay storm testing
S3AP Monitoring ONTAP health path verified Split S3AP health check (VPC-external)
Protobuf Framing Property/integration tested Live ONTAP protobuf wire validation
Multi-account OAM Stack deployed conditionally Second-account validation
Production UC E2E Pipeline verified to SQS delivery Full TriggerMode=EVENT_DRIVEN UC flow
Cost Dashboard Not yet deployed Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation

Phase 13 candidates

Operational readiness:

  1. Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
  2. SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
  3. Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior

Enterprise deployment:

  1. Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
  2. Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
  3. Production UC end-to-end: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify the complete flow from NFS file creation through Step Functions execution to output generation

Protocol and cost:

  1. Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate AUTO_DETECT mode against real ONTAP protobuf traffic
  2. Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics

Decision trees and operational guides:

  1. Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
  2. NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover

Cost model awareness

While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:

Category Cost Type Driver
FPolicy server (Fargate/EC2) Fixed baseline Always-on listener
NAT Gateway Fixed + per-GB Required if VPC Lambda needs Internet-origin S3AP access
CloudWatch Synthetics Per-canary-run 5-minute interval = 8,640 runs/month
CloudWatch custom metrics + Logs Per-metric + per-GB ingested SLO metrics, FPolicy server logs
DynamoDB (lineage + guardrails) Per-request (PAY_PER_REQUEST) Event volume dependent
SQS / EventBridge Per-message / per-event Event volume dependent
Persistent Store volume Per-GB provisioned Sized for max queued events during downtime

Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).

NetworkOrigin decision table

Based on AWS documentation, the following decision criteria apply:

Choose VPC-origin when:

  • All consumers are Lambda/ECS/EC2 inside the same VPC
  • Private connectivity is mandatory (no internet-routed path allowed)
  • VPC endpoint policy is part of the security boundary
  • Network restriction is built-in (cannot be accidentally misconfigured)

Choose Internet-origin when:

  • External accounts or on-premises clients need access
  • Consumers are outside the bound VPC
  • Internet-routed access with IAM controls is acceptable
  • Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC
Factor VPC-origin Internet-origin
Network enforcement Built-in explicit Deny for non-VPC traffic Policy-based only
VPC endpoint required Yes (Gateway or Interface in bound VPC) Only if using aws:SourceVpc conditions
Multi-VPC access Via Interface endpoint + peering/TGW to bound VPC Via policy conditions
Change access scope Must recreate access point Update policy
On-premises access Via Interface endpoint in bound VPC Direct with IAM credentials
Cost implication VPC endpoint (Gateway=free, Interface=hourly) NAT Gateway if VPC Lambda needs access

Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.

Phase 12 readiness by workload type

Workload Phase 12 Ready? Notes
Controlled PoC / single-account ✅ Ready All core components verified
Low/moderate event volume (< 100 events/day) ✅ Ready 20-event burst validated
DRY_RUN guardrail validation ✅ Ready Safe to deploy immediately
Secrets rotation validation ✅ Ready 4-step rotation verified
High-volume replay storm (1000+ events) ⏳ Phase 13 Throughput curve and store capacity not yet measured
Multi-account production ⏳ Phase 13 OAM link deployed but second-account validation pending
Strict SLO operations requiring runbooks ⏳ Phase 13 Dashboard deployed, runbooks not yet written
Live protobuf production mode ⏳ Phase 13 Wire validation with NetApp support pending
Full EVENT_DRIVEN UC end-to-end ⏳ Phase 13 Pipeline verified to SQS, Step Functions flow pending

Phase 13 runbook scope: first-response diagnostic bundle

For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:

# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>

# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>

# EMS logs for FPolicy events
event log show -messagename *fpolicy*
Enter fullscreen mode Exit fullscreen mode

Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.


Deployed Infrastructure

7 CloudFormation stacks deployed and verified:

Stack Status Purpose
fsxn-phase12-guardrails-table CREATE_COMPLETE DynamoDB tracking table
fsxn-phase12-lineage-table CREATE_COMPLETE Data lineage DynamoDB + GSI
fsxn-phase12-slo-dashboard CREATE_COMPLETE CloudWatch dashboard + 4 alarms
fsxn-phase12-oam-link CREATE_COMPLETE Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)
fsxn-phase12-capacity-forecast CREATE_COMPLETE Lambda + EventBridge schedule
fsxn-phase12-secrets-rotation CREATE_COMPLETE VPC Lambda + rotation config
fsxn-phase12-synthetic-monitoring CREATE_COMPLETE Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13

CloudFormation Stacks


Test Results Summary

Category Count Type Result
Unit Tests 116 Local (CI-reproducible) ✅ All pass
Property Tests (Hypothesis) 53 Local (CI-reproducible) ✅ All pass
CloudFormation Deployments 7 stacks AWS integration ✅ All CREATE_COMPLETE
Lambda Invocations 2 (forecast + rotation) AWS integration ✅ Successful
FPolicy E2E 1 pipeline test AWS manual verification ✅ Event delivered
Replay E2E 5 events AWS manual verification ✅ Zero loss
20-file burst 20 events AWS manual verification ✅ Zero loss
Bugs found (property testing) 3 Local (CI-reproducible) ✅ All fixed

NetApp-Specific Takeaways

For NetApp users and partners evaluating this pattern:

  • FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
  • S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
  • NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
  • ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
  • Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.

The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.


Conclusion

Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.

The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.

With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.


Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11

Top comments (0)