DEV Community: Yoshiki Fujiwara(藤原善基)@AWS Community Builder

Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Fri, 22 May 2026 08:28:29 +0000

TL;DR

You can query files stored on Amazon FSx for NetApp ONTAP directly from Amazon Athena through an FSx-attached S3 Access Point — without copying the source data to an S3 bucket. The source files remain on the FSx for ONTAP volume and are accessed through S3 object APIs.

I verified this end-to-end: Parquet files written via NFS are immediately queryable from Athena using the official AWS tutorial pattern.

This is Part 1 of a series exploring how FSx for ONTAP S3 Access Points integrate with various Lakehouse platforms. Part 2 covers Databricks — where platform security boundaries make things significantly more complex.

GitHub Repository: fsxn-lakehouse-integrations

If you want to reproduce this validation, start from the repository's integrations/athena/ directory, which contains CloudFormation templates, sample data generators, and query scripts.

What Is Verified in This Article

Verified:

NFS-written Parquet file is visible via FSx S3 AP (ListObjectsV2, StorageClass: FSX_ONTAP)
Athena can query the file through Glue Data Catalog
Standard S3 bucket result location works as the documented pattern
Experimental FSx S3 AP result output worked in my environment

Not verified:

Delta / Hudi / Iceberg writes
CTAS production pattern to FSx S3 AP
S3 bucket event notification semantics
Large-scale performance limits
CloudTrail data event coverage (audit evidence approach should be validated per environment)

Why This Matters

Enterprise file servers hold massive amounts of data — design files, inspection images, research documents, log archives. Traditionally, to analyze this data with cloud-native tools like Athena, you had to:

Copy data from NFS/SMB to S3 (DataSync, scripts, etc.)
Maintain sync pipelines
Pay for duplicate storage
Deal with stale data

FSx for ONTAP S3 Access Points (launched December 2025) change this. The same volume that serves NFS/SMB clients now exposes an S3-compatible API. Athena queries hit the same bytes that your NFS clients read — no copy required for the source dataset.

Users (NFS/SMB)                    Athena (S3 API)
      │                                  │
      ▼                                  ▼
┌─────────────────────────────────────────────┐
│         FSx for ONTAP Volume                │
│         /analytics/sensor_data.parquet      │
│         /analytics/logs/*.json              │
└─────────────────────────────────────────────┘

Use Cases This Unlocks

This pattern is useful when enterprise data already lives on NFS/SMB file shares and analytics teams want to query it without building a copy pipeline to S3.

Examples:

Manufacturing: Sensor logs, inspection results, quality reports produced by factory systems
SAP / ERP: Batch export files, operational reports, reconciliation extracts, and analytics copies — not direct replacement for application-native persistence or HA design
Financial services: Reconciliation files, transaction logs, regulatory extracts
Healthcare research: De-identified datasets, imaging metadata, study outputs
EDA / Semiconductor: Design artifacts, simulation outputs, verification logs
Enterprise file services: Archives for compliance analysis, audit evidence

Mission-critical workload note
This pattern provides an analytics read-access layer for existing file data. It does not replace workload-specific HA, backup, Snapshot, SnapMirror, or DR designs. For SAP, databases, VDI, and enterprise file services, treat Athena-on-FSx as an analytics and evidence layer, not as the primary resilience architecture.

Workload Isolation Guidance

For mission-critical workloads, do not point exploratory analytics directly at the same directory used by latency-sensitive application writes unless the operational impact has been tested.

Recommended pattern:

Application-owned path: /prod/app-output/
Analytics landing path: /analytics/curated/
Athena query result path: Standard S3 bucket (conservative), or a separately validated output path
Snapshot / backup policy: Owned by the workload team
Glue/Athena access: Owned by the analytics platform team

For SAP, database exports, or ERP file drops, treat this pattern as a read-access analytics layer. Do not change application HA, backup, restore, or DR design just because the files are queryable through S3 APIs.

In this context, an analytics copy means an application-produced or batch-exported file that is safe for downstream analytics, not the primary application persistence path.

Operational Impact Validation

Before production use, validate operational impact:

Baseline NFS/SMB workload latency and throughput before enabling analytics queries
Athena query behavior during normal application write activity
FSx provisioned throughput utilization during scans (analytics and application workloads share the same backend throughput)
Query concurrency limits for the analytics team
Rollback plan if analytics workload affects application workload

Recommended metrics include FSx throughput utilization, client-side NFS/SMB latency, Athena query runtime, bytes scanned, and application-side error or timeout rates during query execution.

Rollback plan examples include disabling the Athena workgroup, revoking the S3 Access Point policy for analytics roles, reducing analytics query concurrency, or moving analytics to an isolated curated path.

What This Means for Production

For production, treat this as a shared-storage analytics access pattern. The value is eliminating source data copy; the responsibility is validating workload isolation, throughput impact, governance, and rollback.

This article is not a production certification. It is intended to start a production readiness discussion around workload isolation, governance, and rollback.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  AWS Account                                                    │
│                                                                 │
│  ┌──────────────┐     ┌──────────────┐     ┌────────────────┐   │
│  │ FSx for ONTAP│     │ S3 Access    │     │ Athena         │   │
│  │ Volume       │◄────│ Point        │◄────│ (Serverless)   │   │
│  │              │     │ (Internet    │     │                │   │
│  │ /analytics/  │     │  origin)     │     │ SELECT ...     │   │
│  └──────────────┘     └──────────────┘     │ FROM table     │   │
│        ▲                     ▲             └────────────────┘   │
│        │                     │                      │           │
│   NFS/SMB clients      Glue Crawler          Query results      │
│   (write data)         (schema discovery)    (→ S3 bucket)      │
└─────────────────────────────────────────────────────────────────┘

Key points:

The access point must use Internet network origin. Athena accesses S3 from managed infrastructure outside your VPC. The AWS tutorial requires internet network origin for this path. VPC-origin access points deny requests from Athena.
Glue Data Catalog provides the schema layer between Athena and the S3 AP
Query results are written to an S3 bucket (the standard Athena pattern), not back to the FSx volume. See Observed Behavior for an experimental alternative.

Prerequisites

FSx for ONTAP file system (ONTAP 9.17.1+)
A volume with data (Parquet, CSV, JSON, etc.)
S3 Access Point created with Internet network origin
An Athena workgroup with a query results location (standard S3 bucket)
IAM permissions for Athena, Glue, and S3 AP access

Step 1: Create the S3 Access Point

aws fsx create-and-attach-s3-access-point \
  --name my-analytics-ap \
  --type ONTAP \
  --ontap-configuration '{
    "VolumeId": "<YOUR_VOLUME_ID>",
    "FileSystemIdentity": {
      "Type": "UNIX",
      "UnixUser": {"Name": "fsxn_athena_reader"}
    }
  }' \
  --region <YOUR_REGION>

Wait for the lifecycle to become AVAILABLE:

aws fsx describe-s3-access-point-attachments \
  --filters Name=volume-id,Values=<YOUR_VOLUME_ID> \
  --region <YOUR_REGION> \
  --query 'S3AccessPointAttachments[].{Name:Name,Lifecycle:Lifecycle,Alias:S3AccessPoint.Alias}'

Output:

[{
  "Name": "my-analytics-ap",
  "Lifecycle": "AVAILABLE",
  "Alias": "my-analytics-ap-xxxxxxxxxxxxxxxxxxxxxxxxxxxx-ext-s3alias"
}]

Note: The alias ending in -ext-s3alias identifies this as an FSx for ONTAP S3 Access Point (as opposed to regular S3 Access Points which end in -s3alias).

Security note for file-system identity
This walkthrough uses a dedicated read-only identity (fsxn_athena_reader). Make sure the corresponding UNIX/Windows permissions allow read access to the analytics path. Avoid using root in production — scope the identity to the minimum permissions required.

Step 2: Set the Access Point Policy

This walkthrough uses role-based principals for Athena and Glue. Replace the placeholder role ARNs with the IAM roles used by your Athena workgroup and Glue crawler. Avoid account-wide principals in production.

aws s3control put-access-point-policy \
  --account-id <YOUR_ACCOUNT_ID> \
  --name my-analytics-ap \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "AllowAnalyticsRead",
      "Effect": "Allow",
      "Principal": {"AWS": [
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>",
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<GLUE_CRAWLER_ROLE>"
      ]},
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap",
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/*"
      ]
    }]
  }' \
  --region <YOUR_REGION>

The policy above is the conservative read-only analytics policy. If you intentionally test query result output to the FSx S3 Access Point (see Observed Behavior), add s3:PutObject scoped to the experimental output prefix only:

{
  "Sid": "AllowExperimentalResultWrite",
  "Effect": "Allow",
  "Principal": {"AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>"},
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/athena-results/*"
}

Security note: FSx for ONTAP S3 Access Points enforce S3 Block Public Access by default — this cannot be disabled. All requests require valid IAM credentials. Additionally, the file system user associated with the access point must have read permission on the files being queried.

Policy note: The policy above is the minimum that worked in my validation. If your Glue crawler or Athena workgroup reports location-related access errors, compare the policy with the official tutorial and CloudTrail events, and add only the required actions.

Step 3: Upload Test Data via NFS

On a machine with NFS access to the FSx volume:

import pandas as pd
import numpy as np

# Generate 10,000 rows of sensor data
np.random.seed(42)
n_rows = 10000
df = pd.DataFrame({
    'timestamp': pd.date_range('2026-01-01', periods=n_rows, freq='1min'),
    'sensor_id': np.random.choice(['sensor_A', 'sensor_B', 'sensor_C',
                                    'sensor_D', 'sensor_E'], n_rows),
    'temperature': np.round(np.random.normal(25, 5, n_rows), 2),
    'humidity': np.round(np.random.uniform(30, 90, n_rows), 2),
    'pressure': np.round(np.random.normal(1013, 10, n_rows), 2),
    'status': np.random.choice(['normal', 'warning', 'critical'], n_rows,
                                p=[0.85, 0.12, 0.03])
})

# Write as Parquet to the NFS-mounted volume
df.to_parquet('/mnt/fsxn/analytics/sensor-data/sensor_data.parquet', index=False)
print(f"Written {len(df)} rows, {df.memory_usage(deep=True).sum()/1024:.0f} KB")

The same file is now accessible via both NFS (/mnt/fsxn/analytics/sensor-data/sensor_data.parquet) and S3 API (s3://<AP_ALIAS>/sensor-data/sensor_data.parquet).

Step 4: Verify S3 AP Access

aws s3api list-objects-v2 \
  --bucket "$AP_ALIAS" \
  --prefix "sensor-data/" \
  --region <YOUR_REGION>

Output:

{
  "Contents": [{
    "Key": "sensor-data/sensor_data.parquet",
    "Size": 252858,
    "StorageClass": "FSX_ONTAP"
  }]
}

Note the StorageClass: FSX_ONTAP — this confirms the data lives on FSx, not S3.

Step 5: Create Glue Database and Table

aws glue create-database \
  --database-input '{"Name": "fsxn_analytics"}' \
  --region <YOUR_REGION>

You can either run a Glue Crawler for automatic schema discovery (recommended by the AWS tutorial), or create the table manually via Athena:

CREATE EXTERNAL TABLE fsxn_analytics.sensor_data (
  timestamp TIMESTAMP,
  sensor_id STRING,
  temperature DOUBLE,
  humidity DOUBLE,
  pressure DOUBLE,
  status STRING
)
STORED AS PARQUET
LOCATION 's3://<AP_ALIAS>/sensor-data/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

Step 6: Query with Athena

Basic aggregation

SELECT
  sensor_id,
  COUNT(*) AS readings,
  ROUND(AVG(temperature), 2) AS avg_temp,
  ROUND(AVG(humidity), 2) AS avg_humidity,
  SUM(CASE WHEN status = 'critical' THEN 1 ELSE 0 END) AS critical_count
FROM fsxn_analytics.sensor_data
GROUP BY sensor_id
ORDER BY critical_count DESC;

Verified result

sensor_id | readings | avg_temp | avg_humidity | critical_count
----------|----------|----------|--------------|---------------
sensor_A  |    2027  |   24.89  |    59.84     |      68
sensor_B  |    1986  |   25.11  |    60.23     |      62
sensor_C  |    2013  |   24.95  |    59.91     |      59
sensor_D  |    1974  |   25.03  |    60.15     |      55
sensor_E  |    2000  |   24.98  |    60.02     |      56

Query time: 1.46 seconds | Data scanned: 67 KB | Engine: Athena v3

Observed Behavior: Query Results Written to the FSx S3 Access Point

The AWS tutorial states:

"Athena reads data from your FSx for ONTAP volume through the access point. Athena query results are written to the Amazon S3 results bucket, not back to the FSx for ONTAP volume."

In my validation, however, setting OutputLocation to the FSx for ONTAP S3 Access Point alias succeeded and wrote the .csv and .metadata files back to the FSx volume:

aws athena start-query-execution \
  --query-string "SELECT 1 AS test" \
  --result-configuration \
    "OutputLocation=s3://<AP_ALIAS>/athena-results/" \
  --work-group primary \
  --region <YOUR_REGION>

Result: SUCCEEDED in 584ms

The result files appeared on the FSx volume and were immediately accessible via NFS.

Treat this as observed behavior from my environment, not a general production recommendation. The conservative production pattern is:

Source data: FSx for ONTAP S3 Access Point
Athena query results: Standard S3 bucket (as documented)

The experimental pattern validated in this post:

Source data: FSx for ONTAP S3 Access Point
Athena query results: FSx for ONTAP S3 Access Point (observed to work, not documented)

Validate this in your own environment before relying on it.

Governance warning: Do not enable experimental query result output to FSx S3 AP for sensitive datasets unless query result retention, encryption, audit evidence, and file-system permissions are reviewed. Query results may contain derived sensitive information. For sensitive datasets, experimental result output should require approval from the data owner, security owner, and workload owner.

Performance Characteristics

Metric	Observed	Notes
Simple SELECT query	584 ms	Includes result write
Aggregation (10K rows, 67KB)	1.46 s	GROUP BY with 5 aggregations
Data scan cost	Standard Athena pricing	$5 per TB scanned
Storage class	FSX_ONTAP	Confirmed in ListObjects

Performance note
These numbers validate functional compatibility, not performance limits. The dataset is intentionally small (67 KB, 10K rows). For real analytics workloads, test with realistic file sizes, object counts, partition layouts, concurrent queries, and FSx provisioned throughput. The throughput available through the S3 API depends on the FSx file system's provisioned throughput capacity (AWS documentation).

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points expose file data through S3 object APIs, but they should not be treated as standard S3 buckets.

The safe mental model is:

Use S3 APIs for object read/write access to files on FSx
Use Glue and Athena for read-oriented analytics
Do not assume S3 bucket-level features exist (event notifications, versioning, lifecycle policies)
Do not assume lakehouse commit semantics (rename, conditional writes)
Validate every platform integration separately

In this article, the verified pattern is read-oriented analytics over Parquet/CSV/JSON files. Transactional table formats and commit protocols are outside the safe default boundary.

Compatibility Matrix

Validated by legend:

This validation: Actually executed commands or queries in this environment and confirmed the result
Supported operations review: Confirmed based on the supported operations documentation or official tutorial
Supported operations review required: Not yet confirmed; additional validation needed before use

Capability	Status	Validated by	Notes
ListObjectsV2	✅ Verified	This validation	S3 AP alias worked
GetObject (Parquet scan)	✅ Verified	This validation	Athena v3
PutObject (small result file)	⚠️ Observed	This validation	Not documented as Athena result pattern
Glue table over S3 AP	✅ Verified	This validation	Manual DDL and Crawler
CTAS to S3 AP	❌ Failed in validation	This validation	Not part of the documented tutorial pattern; use standard S3 output
Delta Lake writes	❌ Not recommended	Supported operations review	Commit protocol depends on rename/atomic semantics not available
Hudi/Iceberg writes	❌ Not recommended	Supported operations review	Requires commit semantics beyond simple object read
S3 bucket event notifications	❌ Not part of verified pattern	Supported operations review required	Do not assume bucket-level eventing; validate against supported operations

CTAS is a write-path pattern, not just a read query. Treat CTAS separately from read-oriented SELECT validation because it writes new table data to a target S3 location and may leave partial/orphaned files on failure. CTAS should not be included in the initial read-oriented validation scope.

Transactional lakehouse formats may require semantics beyond simple object read/write, such as:

Atomic commit behavior
Rename or move-like commit operations
Conditional writes (If-None-Match)
Manifest consistency
Concurrent writer coordination
Cleanup of partial/orphaned files

This article does not validate those semantics. It validates read-oriented analytics over existing files.

Governance and Compliance Considerations

This pattern keeps the source files on FSx for ONTAP, but it does not remove the need for data governance.

Before using this pattern with regulated or sensitive datasets, review:

Data classification of source files
IAM and S3 Access Point policy scope (least privilege)
File system identity mapped to the access point (UNIX/Windows user permissions apply)
Glue Data Catalog permissions (who can see the table metadata)
Athena workgroup controls (query limits, result encryption)
Query result location and retention (results may contain derived sensitive data)
CloudTrail / audit evidence requirements
Snapshot, backup, retention, and deletion policy

Query results can be more sensitive than the original dataset because they may aggregate, filter, or derive new information. Apply encryption, retention, and access controls to the Athena result location as carefully as the source dataset.

This article is a technical validation, not a compliance attestation.

Production Controls Checklist

For regulated or sensitive datasets, define the following before production use:

[ ] Athena workgroup result location (standard S3 bucket)
[ ] Whether workgroup settings override client-side result settings
[ ] Query result encryption mode and KMS key ownership
[ ] Query result retention and deletion policy
[ ] IAM principals allowed to query the Glue table
[ ] File-system identity mapped to the S3 Access Point (dedicated, not root)
[ ] Audit evidence approach defined and validated (e.g., CloudTrail coverage for the S3 Access Point where applicable, with sample events captured as PoC evidence)
[ ] Approval process for enabling experimental result output to FSx S3 AP

For regulated workloads, consider enabling Athena workgroup override so that query result location and encryption cannot be changed by client-side settings. This prevents individual clients from changing where query results are written or how they are encrypted.

For regulated workloads, experimental writeback should be disabled by default and enabled only after explicit approval from the data owner, security owner, and workload owner.

Experimental writeback may be enabled only when:

Approval scope is documented
Output path is isolated from source data
Encryption and retention are defined for the output path
Cleanup and rollback procedures are documented
Review expiration date is set

Minimum audit evidence artifacts for PoC completion:

Scope statement: what the audit evidence demonstrates and what it does not (e.g., "validates access path and query result control for PoC scope; does not demonstrate full production compliance")
Access path description (IAM → AP policy → file-system identity)
Sample successful read event
Sample denied access event (if applicable)
Query result location configuration
Encryption configuration
Workgroup override setting (if used)
Reviewer sign-off (name, role, date, decision)

30-Minute Validation Flow

Create or verify the FSx S3 Access Point (AVAILABLE lifecycle)
Write one Parquet file through NFS to the analytics path
Confirm StorageClass: FSX_ONTAP with list-objects-v2
Create the Glue table (manual DDL or crawler)
Run one Athena query
Capture the validation artifacts (see below)
Decide Go / No-Go using the PoC Success Criteria

First Success Path

If you are validating this for the first time, keep the scope small.

Expected outcome:

One Parquet file written through NFS is visible through the S3 Access Point
Glue table creation or crawler schema discovery succeeds
Athena can query the file in place
Query result location behavior is validated and documented
NFS/SMB clients can still access the original file
IAM and file-system identity boundaries are understood

Do not start with Delta Lake, Hudi, Iceberg writes, large scans, or concurrent workloads. Prove the read path first.

PoC Success Criteria

Minimum success:

S3 Access Point attachment is AVAILABLE
ListObjectsV2 returns the expected test file
Glue table points to the S3 AP alias
Athena query succeeds and returns correct results
Results are reproducible from a clean workgroup/session

Operational success:

IAM role and S3 AP policy are scoped to the analytics roles
Athena workgroup controls are defined
Query result location and retention are documented
Dataset size and scan cost are measured
FSx throughput impact is measured during query
Existing NFS/SMB application workload impact is measured during Athena queries

Go / No-Go criteria:

Go: Read-only analytics on Parquet/CSV/JSON works with acceptable latency and cost
No-Go: Workload requires Delta/Hudi/Iceberg write commits through the S3 AP
No-Go: Platform governance requires Unity Catalog external locations and the platform cannot yet authorize the S3 AP (see Part 2)

Performance Test Plan

Note: This section defines the performance test plan and metrics to collect. It does not present benchmark results. Actual benchmark outputs will be added under verification-pack/ after validation runs are completed.

The next validation should include:

1 GB / 10 GB / 100 GB datasets
Many small files vs fewer large Parquet files
Partitioned layout (date=YYYY-MM-DD/sensor_id=...)
Concurrent Athena queries
Different FSx throughput capacity settings (128 / 256 / 512+ MBps)
NFS writer activity during Athena scans
Standard S3 result bucket vs observed FSx S3 AP result output

The goal is to separate Athena scan behavior, Glue metadata behavior, and FSx provisioned-throughput impact.

Additional request pattern considerations:

Sequential vs parallel S3 API reads
Prefix layout impact on listing performance
Small object listing overhead
Repeated query behavior with warm Glue/Athena metadata

Metrics collection sources:

FSx metrics: CloudWatch (FSx namespace)
Athena query metrics: get-query-execution API (EngineExecutionTimeInMillis, DataScannedInBytes)
Client-side latency: CLI timing or SDK instrumentation
Error/timeout sources: Athena query execution status and failure reason, client-side logs, application-side timeout logs, CloudTrail events where applicable

Record results separately for cold run (1+), warm metadata run (1+), repeated run (3+ executions). Report average, min, max, and notable outliers.

Validation Artifacts

For reproducibility, capture the following artifacts in your PoC:

S3 Access Point attachment lifecycle output (describe-s3-access-point-attachments)
list-objects-v2 output showing StorageClass: FSX_ONTAP
Glue table DDL or crawler output
Athena query execution ID
Athena query runtime and scanned bytes
Query result location and file listing
NFS listing showing the original source file is unchanged
IAM policy and access point policy used for the test

What's Next

In Part 2, I'll cover what happens when you try to connect Databricks to FSx for ONTAP S3 Access Points — where Unity Catalog's session policy, seccomp filters, and platform security boundaries create a significantly more complex picture.

References

This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) in May 2026.

Scope reminder: This article verifies a limited read-oriented scenario. It does not validate production readiness, write-path behavior, distributed executor-scale processing, or all third-party analytics engines.

Article update plan: v1.0 (current) — Scope, observed behavior, validation plan. Future updates: v1.1 — Benchmark results with realistic datasets. v1.2 — Security Verified candidate review. v1.3 — Production workload isolation test results.

Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Thu, 21 May 2026 09:12:09 +0000

TL;DR

We built a direct Lambda-to-Grafana Cloud pipeline that ships FSx for ONTAP audit logs to Loki without an intermediate OTel Collector. Three Lambda functions cover all event sources:

FSx for ONTAP audit logs → EventBridge Scheduler (every 5 min) → Lambda (polls & reads via S3 Access Point) → OTLP Gateway → Loki
EMS webhooks (ransomware alerts, quota warnings) → API Gateway → Lambda → OTLP Gateway → Loki
FPolicy file operations (real-time CIFS/SMB events) → ECS Fargate → SQS → Bridge Lambda → EventBridge → Lambda → OTLP Gateway → Loki

Everything is CloudFormation-templated, parameterized, and deployable with a single script. No hardcoded values, and the infrastructure is fully parameterized. This is a Grafana-specific direct integration by design; use the Collector path from Part 5 when you need backend portability.

If you only want to validate the path quickly, jump to First Success Path and deploy the audit poller first.

This is the single-backend counterpart to Part 5: simpler when Grafana Cloud is the chosen destination, less flexible when backend portability, enrichment, redaction, or multi-backend routing is required.

Why Direct Send (Without OTel Collector)?

In Part 5, we showed how the OTel Collector decouples Lambda from backends. That's the right choice when you need multi-backend delivery or vendor migration flexibility.

But if Grafana Cloud is your single observability platform and your goal is a simple serverless path, direct OTLP can be a good starting point. For production pipelines that need richer buffering, metadata enrichment, redaction, or routing, Grafana recommends an Alloy / Collector-based architecture.

Approach	Components	Latency	Cost
OTel Collector	Lambda → Collector (ECS/EC2) → Grafana	+50-100ms	Collector compute
Direct send	Lambda → Grafana OTLP Gateway	Minimal	Lambda only

The direct path is simpler, cheaper, and has fewer failure points. You can always graduate to the Collector path later (Part 5 shows how). Direct send is a good fit when operational simplicity is more important than in-pipeline enrichment, redaction, buffering, and multi-backend routing. If those requirements become mandatory, move the same OTLP payload model behind Alloy or the OpenTelemetry Collector.

Direct send reduces moving parts, but it also removes the Collector / Alloy queueing layer. For production, decide whether Lambda retry and DLQ are sufficient, or whether you need SQS buffering, DLQ replay, or the Collector / Alloy path for stronger delivery guarantees during endpoint outages or throttling.

Delivery guarantee decision (see full pattern guide):

Quickstart (this template): Scheduler retry + Scheduler DLQ + Lambda reserved concurrency + checkpoint retry

Medium volume: add Lambda failure destination and operational replay procedures

Higher reliability: insert SQS before shipping, or place Alloy / OTel Collector behind Lambda for batching, retry with persistent queue, transform, redaction, and multi-backend routing

Multi-backend or redaction/routing: use Part 5 Collector path

Architecture

┌─────────────────────────────────────────────────────┐
│ Event Sources                                        │
├─────────────────────────────────────────────────────┤
│                                                      │
│  EventBridge Scheduler                               │
│  rate(5 minutes) ──→ Lambda                          │
│                       │ lists new files via           │
│                       │ S3 Access Point              │
│                       │ (checkpoint in SSM)          │
│                       ▼                              │
│                OTLP Gateway                          │
│                (Grafana Cloud)                        │
│                       │                              │
│  EMS Webhook          │                              │
│  ──→ API GW ──→ Lambda ────────────┤                │
│     (ems_handler)                   │                │
│                                     ▼                │
│  FPolicy                           Loki             │
│  ──→ ECS Fargate ──→ SQS          (Explore,        │
│  ──→ Bridge Lambda                  Dashboard)      │
│  ──→ EventBridge                                    │
│  ──→ Lambda (fpolicy_handler) ─────────────────────┤
└─────────────────────────────────────────────────────┘

The audit log path uses a polling pattern: EventBridge Scheduler invokes Lambda every 5 minutes. Lambda lists new objects via the S3 Access Point, reads and processes them, then updates an SSM Parameter Store checkpoint to track progress. This avoids reliance on S3 Event Notifications, which are not supported by FSx for ONTAP S3 Access Points.

The same S3 Access Point boundary can be reused for other automation patterns (AI/ML, analytics, compliance archival) because the audit files remain on FSx for ONTAP while Lambda reads them through standard S3 object APIs — no data copy or NFS/SMB mount required.

This pattern does not replace ONTAP audit, EMS, or FPolicy configuration; it provides an AWS-native delivery and visualization layer for those ONTAP-native signals.

For business-critical workloads such as SAP, databases, VDI, or enterprise file services, treat this pipeline as an observability and evidence layer. It complements, but does not replace, workload-specific HA, backup, restore, and DR designs.

Use cases this unlocks:

Investigate file access activity for FSx for ONTAP-hosted enterprise file shares
Monitor available ONTAP EMS alerts, such as ransomware-related events, quota warnings, and storage/system events
Correlate audit logs, EMS, and FPolicy file operations in a single Grafana dashboard
Provide a lightweight observability path for SAP, database, VDI, and file service workloads using FSx for ONTAP
Start with direct OTLP delivery and graduate to Alloy / Collector when governance or multi-backend routing is required

The FPolicy path has two Lambda roles: a bridge Lambda that converts ECS/FPolicy server SQS output into EventBridge events, and fpolicy_handler.py, which ships those normalized EventBridge events to Grafana Cloud.

Key Discovery: OTLP Gateway, Not Loki Push API

During E2E verification, the Loki Push API returned HTTP 530 in my trial account. The OTLP Gateway worked reliably in this project and is the recommended Grafana Cloud OTLP ingestion path.

For logs, Grafana Cloud routes OTLP log data to Loki, where it becomes queryable with LogQL.

Our Lambda auto-detects the endpoint mode from the URL:

def _is_otlp_endpoint(endpoint: str) -> bool:
    """Detect Grafana OTLP Gateway or generic OTLP/HTTP logs endpoint."""
    endpoint = endpoint.rstrip("/")
    return (
        "otlp-gateway" in endpoint
        or endpoint.endswith("/otlp")
        or endpoint.endswith("/otlp/v1/logs")
        or endpoint.endswith("/v1/logs")
    )

USE_OTLP = _is_otlp_endpoint(LOKI_ENDPOINT)

When using the OTLP Gateway, configure LOKI_ENDPOINT as the base OTLP endpoint ending in /otlp. The Lambda appends /v1/logs when sending logs:

# Configure as base endpoint (Lambda appends /v1/logs)
LOKI_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
# Lambda POSTs to: https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp/v1/logs

The handler also accepts the full path (/otlp/v1/logs) without double-appending.

Endpoint	URL Pattern	Status
OTLP Gateway (preferred)	`https://otlp-gateway-prod-<region>.grafana.net/otlp`	✅ Recommended by Grafana Cloud docs; verified in this project
Loki Push API (fallback)	`https://logs-prod-<region>.grafana.net/loki/api/v1/push`	⚠️ May behave differently by account state; returned 530 in my trial validation
Self-hosted Loki OTLP	`https://<loki-host>/otlp`	Requires Loki OTLP ingestion support and structured metadata configuration; Loki 3.0+ enables structured metadata by default

Authentication: Basic Auth with base64 Encoding

Grafana Cloud uses Basic Auth for both endpoints. The critical detail: the value is base64(instanceId:apiToken), not plain text concatenation.

from base64 import b64encode

instance_id = "123456"  # From Grafana Cloud console
api_token = "glc_..."   # logs:write scope

credentials = f"{instance_id}:{api_token}"
auth_header = f"Basic {b64encode(credentials.encode()).decode()}"

Credentials are stored in AWS Secrets Manager as JSON:

{"instance_id": "<id>", "api_key": "<token>"}

The Lambda reads this at cold start and caches the auth header for subsequent invocations. For production, use the shared auth_cache.py module which provides TTL-based caching with automatic reload-on-401/403, so credential rotation does not require waiting for a new Lambda execution environment.

Internally, normalized records are now converted directly to OTLP as the primary path. Loki Push formatting is kept only as a fallback mode. This aligns with Part 5's "OTLP as producer contract" principle. For the full OTLP resource/log-record/body mapping and fsxn.* attribute naming policy, see the Grafana Operations Guide.

The Three Lambda Handlers

1. FSx Audit Log Handler via S3 Access Point (`handler.py`)

Polls for new FSx ONTAP audit log files via S3 Access Point, parses JSON/EVTX, and ships to Grafana Cloud. Uses SSM Parameter Store to checkpoint progress between invocations.

def lambda_handler(event, context):
    auth_header = get_auth_header()  # Cached from Secrets Manager

    if event.get("source") == "scheduler":
        # Polling mode: list new files, process, update checkpoint
        last_key = get_checkpoint()  # SSM Parameter Store
        new_keys = list_new_keys(S3_ACCESS_POINT_ARN, prefix, last_key,
                                 limit=MAX_KEYS_PER_RUN)

        for key in new_keys:
            if context.get_remaining_time_in_millis() < SAFETY_THRESHOLD_MS:
                break  # Stop early, resume on next scheduled run
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=key)
            logs = parse_logs(raw["Body"].read(), key)
            ship_to_grafana(logs, key, auth_header)  # Raises on failure
            set_checkpoint(key)  # Only after confirmed delivery
    else:
        # Manual test mode using an S3-event-shaped payload
        for record in extract_s3_records(event):
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=record["key"])
            logs = parse_logs(raw["Body"].read(), record["key"])
            ship_to_grafana(logs, record["key"], auth_header)

Query in Grafana Explore:

{service_name="fsxn-audit"} | json | Operation="create"

2. EMS Webhook Handler (`ems_handler.py`)

Receives ONTAP EMS events via API Gateway, parses with the shared EMS parser layer, and forwards to Grafana.

def lambda_handler(event, context):
    body = event.get("body", "")
    normalized = parse_ems_event(body)  # Shared Lambda Layer

    if USE_OTLP:
        payload = format_for_otlp(normalized)
    else:
        payload = format_for_loki(normalized)

    ship_to_grafana(payload, auth_header)

Labels: {service_name="fsxn-ems", source="ontap", severity="alert"}

Security note: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Use API Gateway authorization controls such as an API key, IAM authorization, Lambda authorizer, resource policy, WAF, or source IP restrictions based on your network design. The quickstart template uses AuthorizationType: NONE for simplicity — add appropriate controls before production use. See the webhook security guide for a full comparison of auth modes and a recommended shared-secret Lambda authorizer pattern.

3. FPolicy Handler (`fpolicy_handler.py`)

Subscribes to EventBridge events from the FPolicy ECS Fargate server and forwards file operation events.

def lambda_handler(event, context):
    detail = event.get("detail")  # EventBridge event

    if USE_OTLP:
        payload = format_for_otlp(detail)
    else:
        payload = format_for_loki(detail)

    ship_to_grafana(payload, auth_header)

Labels: {service_name="fsxn-fpolicy", source="ontap", operation="create"}

CloudFormation: Three Templates, Zero Hardcoded Values

Each template is fully parameterized:

Template	Purpose	Key Parameters
`template.yaml`	FSx audit log poller Lambda	S3AccessPointArn, GrafanaCredentialsSecretArn, LokiEndpoint, ScheduleExpression
`template-ems.yaml`	EMS webhook Lambda	GrafanaCredentialsSecretArn, LokiEndpoint, EmsParserLayerArn
`template-fpolicy.yaml`	FPolicy EventBridge Lambda	GrafanaCredentialsSecretArn, LokiEndpoint, EventBusName

The LokiEndpoint parameter accepts both OTLP Gateway and Loki Push API URLs — the Lambda auto-detects the mode. The quickstart template also sets Lambda reserved concurrency to 1 and provisions a Scheduler DLQ with retry policy to avoid overlapping poller runs and preserve failed scheduled invocations. Processing bounds (MAX_KEYS_PER_RUN, SAFETY_THRESHOLD_MS) are configured via Lambda environment variables.

Trigger Model: EventBridge Scheduler Polling

FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge ObjectCreated events. Instead, this integration uses an EventBridge Scheduler polling pattern:

EventBridge Scheduler invokes the Lambda every 5 minutes (configurable via ScheduleExpression parameter)
Lambda lists new files via ListObjectsV2 on the S3 Access Point, using StartAfter to skip already-processed keys
Lambda reads and processes each new file, shipping logs to Grafana Cloud
Checkpoint (SSM Parameter Store) tracks the last successfully processed S3 key — on the next invocation, only newer files are processed

This pattern is simple, cost-effective, and works with AWS S3 API-compatible read paths such as FSx for ONTAP S3 Access Points. The trade-off is polling latency (up to 5 minutes by default) vs. the near-real-time delivery of event-driven triggers.

CloudTrail alternative: CloudTrail data events do work with FSx ONTAP S3 Access Points (confirmed by NetApp Workload Factory's Journal table feature). However, CloudTrail data events add additional delivery latency and $0.10/100K events cost (in my validation, the CloudTrail-based path had 5–15 minutes of end-to-end delay), making the polling pattern the better default for this use case. See the CloudTrail trigger alternative for a full analysis and CloudFormation example.

# CloudFormation: EventBridge Scheduler with retry and DLQ
AuditLogSchedule:
  Type: AWS::Scheduler::Schedule
  Properties:
    ScheduleExpression: !Ref ScheduleExpression  # default: rate(5 minutes)
    FlexibleTimeWindow:
      Mode: 'OFF'
    Target:
      Arn: !GetAtt LogShipperFunction.Arn
      RoleArn: !GetAtt SchedulerRole.Arn
      Input: !Sub '{"source": "scheduler", "s3_access_point_arn": "${S3AccessPointArn}", "prefix": "${S3KeyPrefix}"}'
      RetryPolicy:
        MaximumRetryAttempts: 2
        MaximumEventAgeInSeconds: 3600
      DeadLetterConfig:
        Arn: !GetAtt SchedulerDLQ.Arn

The handler also accepts S3 event format for manual testing via aws lambda invoke, so you can still test individual files without waiting for the scheduler.

Checkpoint Semantics

The quickstart uses a simple high-watermark checkpoint: the last successfully processed object key is stored in SSM Parameter Store, and the next run lists keys after that value.

This works when audit log object keys are monotonically increasing and immutable. For production, validate your audit log naming and rotation behavior. If files can arrive late, be overwritten, or appear out of lexical order, use a stronger checkpoint model such as:

Keeping a short lookback window
Deduplicating by object key + ETag or LastModified
Storing per-object processing state in DynamoDB
Updating the checkpoint only after confirmed Grafana delivery

The checkpoint is advanced only after Grafana returns a successful response for that object. If delivery fails after retries, the Lambda raises an error and the next scheduled run will retry from the last checkpoint.

Failure-path tests verify this behavior: if OTLP delivery returns failure after retries, the Lambda raises and the checkpoint does not advance past the failed object.

Files that parse successfully but contain no shippable records are treated as successfully processed and checkpointed; only delivery failures or parse errors prevent checkpoint advancement.

For production, add a poison-pill policy for files that repeatedly fail parsing or delivery; otherwise one bad file can block later audit logs when using a high-watermark checkpoint. See the Grafana operations guide for poison-pill handling, pipeline health alarms, and custom metrics.

Use SSM Parameter Store for the quickstart high-watermark checkpoint. Move to DynamoDB when you need per-object state, deduplication, replay tracking, or concurrent workers.

Delivery semantics: This pipeline provides at-least-once delivery, not exactly-once. If a Lambda invocation succeeds in sending logs to Grafana but fails before updating the checkpoint (e.g., timeout or transient SSM error), the next run will re-process and re-send those objects. For most observability use cases, duplicate log entries are acceptable. If deduplication is required, implement it explicitly using object key + ETag, event ID, or payload hash in DynamoDB. Do not rely on backend-side deduplication as the primary correctness mechanism.

Avoid Overlapping Poller Runs

Because the audit-log poller is schedule-driven, overlapping Lambda executions can race on the same key range and checkpoint. The quickstart template sets ReservedConcurrentExecutions: 1 to prevent this.

For higher-volume production pipelines, use a distributed lock (e.g., DynamoDB conditional write) and per-object processing state instead of relying on single-concurrency.

The quickstart also configures EventBridge Scheduler with a retry policy (2 retries, 1-hour event age) and a dedicated DLQ. If a scheduled invocation is throttled or fails, the event is preserved in the Scheduler DLQ for visibility and replay.

The quickstart uses 2 retries and 1-hour maximum event age to surface persistent failures quickly while avoiding unbounded retry storms. Increase these values only if your Grafana endpoint outage tolerance and duplicate-handling strategy are defined.

Processing Bounds

The poller bounds work per invocation to avoid timeout-related checkpoint corruption:

Max keys per run (MAX_KEYS_PER_RUN, default: 100): caps the number of files processed in a single invocation
Safety threshold (SAFETY_THRESHOLD_MS, default: 30000): stops processing when remaining Lambda time falls below 30 seconds

Variable	Default	Purpose
`MAX_KEYS_PER_RUN`	`100`	Maximum audit log files processed per invocation
`SAFETY_THRESHOLD_MS`	`30000`	Stop processing before Lambda timeout

Tune these values after observing Lambda duration, checkpoint age, Scheduler DLQ depth, FSx S3 Access Point read throughput, and Grafana send latency.

Because the checkpoint advances after each successfully delivered object, the next scheduled run resumes safely from where the previous run stopped.

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points provide S3 object API access (GetObject, ListObjectsV2, etc.) to file data that remains on the FSx for ONTAP file system. They should not be assumed to have the same bucket-level features or eventing behavior as standard S3 buckets. In this integration, the important difference is eventing: the audit log path uses Scheduler polling instead of S3 Event Notifications.

Minimum Read-Path Permissions

For the audit-log Lambda, verify:

s3:ListBucket on the S3 Access Point ARN
s3:GetObject on the S3 Access Point object ARN ({arn}/object/*)
S3 Access Point policy allows the Lambda execution role
The file-system user associated with the access point has read permission on the audit log path
If the access point is VPC-restricted, the Lambda network path can reach the S3 endpoint

IAM resource ARN examples:

# List access (s3:ListBucket)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>

# Object read (s3:GetObject)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>/object/*

First Success Path

If this is your first deployment, start small:

# Deploy only the audit log poller
export MAX_KEYS_PER_RUN=1
export SAFETY_THRESHOLD_MS=30000
bash integrations/grafana/scripts/deploy.sh --audit-only

Then validate:

Confirm {service_name="fsxn-audit"} in Grafana Explore
Check the Scheduler DLQ is empty
Verify the SSM checkpoint advanced
Create the dashboard
Add EMS and FPolicy only after the audit path works (deploy.sh --all)

deploy.sh passes MAX_KEYS_PER_RUN and SAFETY_THRESHOLD_MS as Lambda environment variables. If unset, the template defaults (100 / 30000) are used.

The first validation should prove three things:

One audit file is visible in Grafana ({service_name="fsxn-audit"})
The SSM checkpoint advanced to the processed key
The Scheduler DLQ remains empty

One-Command Deploy and Cleanup

# Deploy all 3 stacks + update Lambda code (default is --all)
export GRAFANA_SECRET_ARN="arn:aws:secretsmanager:ap-northeast-1:<account>:secret:grafana/fsxn-loki-credentials-XXXXXX"
export S3_ACCESS_POINT_ARN="arn:aws:s3:ap-northeast-1:<account>:accesspoint/fsxn-audit-ap"
export LOKI_ENDPOINT="https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp"

bash integrations/grafana/scripts/deploy.sh --all

The cleanup script removes CloudFormation stacks and optionally deletes synthetic test objects. It does not delete production FSx audit files through the FSx-attached S3 Access Point — those remain on the FSx file system. Pass --s3-bucket and --s3-prefix only if you uploaded test data to a regular S3 bucket during validation.

# Tear down everything (dependency-safe order)
bash integrations/grafana/scripts/cleanup.sh --all \
  --s3-bucket your-bucket --s3-prefix audit/svm-prod-01/

The cleanup script deletes stacks in dependency-safe order (API Gateway before Lambda) and handles DELETE_FAILED states gracefully.

LogQL Query Examples

High-cardinality fields such as UserName and ObjectName remain in the log body and are extracted at query time with | json; they are intentionally not promoted to Loki labels to avoid index bloat and cost.

Once logs arrive, Grafana Explore becomes your investigation tool:

# All audit logs
{service_name="fsxn-audit"}

# Filter by operation
{service_name="fsxn-audit"} | json | Operation="delete"

# Failed access attempts (security investigation)
{service_name="fsxn-audit"} | json | Result="Failure"

# EMS ransomware alerts
{service_name="fsxn-ems"} | json | event_name="arw.volume.state"

# FPolicy file operations
{service_name="fsxn-fpolicy"} | json | operation="create"

# Human-readable format
{service_name="fsxn-audit"} | json | line_format "{{.UserName}} {{.Operation}} {{.ObjectName}}"

# Log volume over time (for dashboards)
count_over_time({service_name="fsxn-audit"}[5m])

Dashboard: 4 Panels for Storage Observability

The following panel queries are the exact queries generated by scripts/create-dashboard.sh and verified against this project's OTLP-ingested log shape. The repository includes a dashboard creation script that provisions a Grafana dashboard via API with four panels:

Log Volume (Time series): count_over_time({service_name="fsxn-audit"}[5m])
Operations Breakdown (Pie chart): sum by (Operation) (count_over_time({service_name="fsxn-audit"} | json [1h]))
User Activity Top 10 (Bar gauge): topk(10, sum by (UserName) (count_over_time({service_name="fsxn-audit"} | json [1h])))
Failed Events (Time series): count_over_time({service_name="fsxn-audit"} | json | Result="Failure" [5m])

Alerting: Ransomware Detection and Security Monitoring

Beyond dashboards, the integration includes three Grafana alerting rules provisioned via scripts/create-alerts.sh:

The table below shows the alert conditions. The provisioning script wraps these into Grafana alert expressions using count/reduce/threshold steps.

Alert	Detection Query (alert condition)	Severity
Ransomware Detection (ARP)	`count_over_time({service_name="fsxn-ems"} \	json \
Quota Soft Limit Exceeded	{% raw %}`count_over_time({service_name="fsxn-ems"} \	json \
Failed Access Spike	{% raw %}`count_over_time({service_name="fsxn-audit"} \	json \

The rules use Grafana's unified alerting format and are deployed to a "FSxN Alerts" folder. Configure contact points (Slack, PagerDuty, email) and notification policies in the Grafana UI to route alerts by severity or team label. The rule definitions are available as {% raw %}alerting/rules.yaml; see the alerting README for provisioning details, no-data behavior, contact point caveats, and threshold tuning guidance.

API compatibility: This script uses Grafana's Alerting Provisioning HTTP API (/api/v1/provisioning). Grafana 13+ introduces newer /apis routes while legacy /api routes remain available; check your Grafana Cloud version if provisioning fails. Provisioning alert rules does not automatically configure notification delivery — create or map contact points and notification policies before relying on these alerts for production response.

The sample rules treat "No data" as OK, because absence of matching ransomware, quota, or failed-access events is expected in normal operation. Query execution errors are routed as Error state for operator attention. These thresholds are starter defaults — tune them per SVM, workload, and normal user behavior before enabling production paging.

For production, monitor the pipeline itself: Scheduler DLQ depth, Lambda errors/throttles/duration, checkpoint age, and Grafana send failures.

Scheduler DLQ Replay

The Scheduler DLQ message is primarily an operational signal and replay payload. Because the poller uses a checkpoint, the next scheduled run may already retry the failed key range automatically.

When a scheduled invocation fails and lands in the Scheduler DLQ:

Inspect the DLQ message (contains the scheduler input payload)
Check the current checkpoint in SSM Parameter Store
Check whether a later scheduled run has already advanced the checkpoint and delivered the missed objects
If the checkpoint has advanced and Grafana shows the data, the failure was auto-recovered — delete the DLQ message
If the checkpoint has NOT advanced, the next scheduled run will retry automatically from the last checkpoint
For manual replay (if auto-retry is insufficient): invoke the Lambda directly with the scheduler payload, then delete the DLQ message

Before manually replaying a DLQ message, compare the DLQ payload with the current SSM checkpoint and Grafana ingestion state to avoid duplicate delivery.

For production, set a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 for the Scheduler DLQ.

Lessons Learned

#	Lesson	Impact
1	Grafana Cloud OTLP endpoint is the recommended ingestion path; in my trial validation, OTLP Gateway succeeded while Loki Push API returned 530	Use OTLP Gateway as default
2	Basic Auth = `base64(instanceId:apiToken)`, not plain text	Auth failures if wrong encoding
3	Loki / Grafana Cloud can reject old timestamps depending on tenant limits; in my validation, logs older than 7 days were rejected	Use current timestamps in test data
4	Grafana HTTP API needs a Grafana Service Account token, not the Grafana Cloud ingestion token used for OTLP writes	Dashboard creation fails with wrong token
5	OTLP-ingested logs use `service_name` label, not `job`	Different query syntax than Loki Push API
6	CloudFormation stack deletion order matters (API GW before Lambda)	DELETE_FAILED if wrong order

Verified Query Matrix

In this Grafana Cloud environment, service.name was exposed as the service_name index label via Loki's default OTLP attribute-to-label mapping. This mapping is configurable per tenant, so validate labels in your own environment if queries return unexpected results.

All queries tested with OTLP-ingested fields in this project's Grafana Cloud instance:

Query	Expected	Verified
`{service_name="fsxn-audit"}`	Audit logs visible	✅
`{service_name="fsxn-audit"} \	json \	Operation="delete"`
`{service_name="fsxn-audit"} \	json \	Result="Failure"`
`{service_name="fsxn-ems"} \	json \	event_name="arw.volume.state"`
`{service_name="fsxn-fpolicy"} \	json \	operation="create"`
`count_over_time({service_name="fsxn-audit"}[5m])`	Time series data	✅

Production and PoC Resources

For deeper validation and production planning:

Delivery Guarantee Patterns — Quickstart → Medium → Higher reliability → Multi-backend
Webhook Security Guide — Auth modes, Lambda authorizer, production baseline
Grafana Operations Guide — Alarms, tuning, poison-pill, ownership, compliance
CloudTrail Trigger Alternative — Event-driven alternative analysis
PoC Checklist — Go/No-Go criteria for stakeholder sign-off
Cost Model — Direct send vs Collector vs Firehose cost comparison
Alerting README — Provisioning details, thresholds, contact point caveats
Graduating to Alloy — Move from direct Lambda OTLP send to an Alloy-backed telemetry pipeline
Partner Solution Brief — Target customers, PoC scope, deliverables, and responsibility boundaries

What's Next

Part 7: Splunk HEC — serverless log delivery with built-in Firehose support
Elastic integration: Bulk API with date-based indices
Cost model refinement: validate the Cost Model with measured volume tiers from real-world FSx for ONTAP workloads

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in with OTel Collector
Part 6: Direct-to-Grafana: Shipping Logs via OTLP Gateway (this post)

Questions about the Grafana Cloud integration or OTLP Gateway? Drop a comment below.

Previous: Part 5 — Escape Vendor Lock-in with OTel Collector

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 19 May 2026 09:10:53 +0000

TL;DR

We shipped the same FSx for ONTAP audit logs to three backends simultaneously — Datadog, Grafana Cloud, and Honeycomb — without changing a single line of Lambda code. The OpenTelemetry Collector sits between our Lambda and the backends as a routing layer. Adding or removing a backend is a YAML config change, not a code deployment.

Same audit logs → 3 backends simultaneously
Zero Lambda code changes between backends (SHA-256 verified)
OTel Collector as the vendor-neutral routing layer
All 3 event sources work: FSx audit logs via S3 Access Point, EMS webhooks, FPolicy file operations

What We're Building

In Part 2, we built a Lambda that speaks Datadog's API directly. It works great — but what happens when your security team wants Splunk, your SRE team wants Grafana, and your platform team is evaluating Honeycomb?

You'd need three separate Lambdas, each with vendor-specific formatting, auth, and retry logic. That's vendor lock-in expressed as infrastructure.

The Problem: Vendor-Specific APIs = Lock-in

Every observability vendor has their own wire format:

Vendor	Auth Header	Payload Format	Endpoint Pattern
Datadog	`DD-API-KEY: <key>`	Custom JSON schema	`https://http-intake.logs.{site}/api/v2/logs`
Splunk	`Authorization: Splunk <token>`	HEC `event` wrapper	`https://<host>:8088/services/collector/event`
Grafana Cloud	`Authorization: Basic <b64>`	OTLP	`https://otlp-gateway-prod-<region>.grafana.net/otlp`
Honeycomb	`x-honeycomb-team: <key>`	OTLP	`https://api.honeycomb.io`

If your Lambda speaks Datadog's API, switching to Grafana Cloud means rewriting your Lambda. That's the lock-in.

The Solution: OTLP as the Producer-to-Collector Contract

OpenTelemetry Protocol (OTLP) is the vendor-neutral producer-to-Collector contract. Our Lambda speaks OTLP — period. The OTel Collector handles routing, processing, and backend-specific export.

┌─────────────────────────────────────────────────────────────────────┐
│ AWS Account                                                         │
│                                                                     │
│  ┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Audit Logs   │────▶│ Lambda           │     │ OTel Collector  │  │
│  │ (via S3 AP)  │────▶│ (OTLP Shipper)   │────▶│ (Docker/Fargate)│  │
│  │ EMS/FPolicy  │────▶│                  │     │                 │  │
│  └──────────────┘     └──────────────────┘     └─┬──────┬──────┬─┘  │
│                                                  │      │      │    │
└──────────────────────────────────────────────────┼──────┼──────┼────┘
                                                   │      │      │
                                                   ▼      ▼      ▼
                                              Datadog  Grafana Honeycomb
                                               (AP1)    Cloud

The Lambda sends OTLP/HTTP to the Collector. The Collector fans out to any combination of backends. Adding Honeycomb? Add 5 lines of YAML. Dropping Datadog? Remove 4 lines. No Lambda redeployment.

Prerequisites

Before starting, you need:

FSx for ONTAP with audit logging configured (see Part 2 for setup)
Docker installed locally (Colima works — see troubleshooting for compose compatibility)
At least one backend account:
- Datadog: API key + site (e.g., ap1.datadoghq.com)
- Grafana Cloud: Instance ID + API token (Cloud Portal → OTLP)
- Honeycomb: Ingest API key (starts with hcaik_)
AWS account with Lambda deployment capability
Parts 1–4 context (recommended but not required — this integration works standalone)

FSx for ONTAP S3 Access Point note: The Lambda reads audit logs through an S3 Access Point attached to the FSx for ONTAP volume. Data remains on the FSx file system — it is not copied to a separate S3 bucket. S3 API throughput via FSx depends on the file system's provisioned throughput capacity, not standard S3 scaling. Validate FSx read throughput separately from Collector and backend ingest throughput.

The OTel Collector Configuration

The Collector config is the heart of this pattern. Here's the full verified configuration for multi-backend delivery:

# otel-collector-config.yaml
# ✅ VERIFIED WORKING (2026-05-18)
# Image: otel/opentelemetry-collector-contrib:0.152.0
# Backends: Grafana Cloud (ap-northeast-0) + Honeycomb

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  # memory_limiter:        # Recommended for production
  #   check_interval: 1s
  #   limit_mib: 512
  #   spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  otlp_http/grafana:
    endpoint: ${env:GRAFANA_OTLP_ENDPOINT}
    headers:
      Authorization: "Basic ${env:GRAFANA_BASIC_AUTH}"

  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Depending on your Honeycomb environment and dataset model, x-honeycomb-dataset may be optional or handled differently. Refer to your Honeycomb OTLP setup page for the recommended configuration.

This article uses otlp_http (the forward-compatible component name). If your Collector version does not recognize it, use the older otlphttp alias or upgrade the Collector.

Section Breakdown

Section	Purpose	Key Settings
`receivers.otlp`	Accepts OTLP/HTTP from Lambda	Port 4318 (OTLP standard)
`processors.batch`	Buffers logs before export	5s timeout OR 1000 records (whichever first)
`exporters.otlp_http/*`	Sends to each backend	Per-backend auth headers
`extensions.health_check`	Liveness probe	Port 13133 for `curl -f` checks
`service.pipelines`	Wires components together	logs: receiver → processor → exporters

Production note: This configuration is suitable for development and validation. For production, add retry_on_failure and sending_queue settings to exporters, configure memory_limiter processor, and consider persistent storage extensions. Without persistent buffering, telemetry in the Collector's in-memory batch can be lost during Collector restarts.

Adding Datadog as a Third Backend

To send to all three simultaneously, add the Datadog exporter:

exporters:
  # ... existing grafana + honeycomb exporters ...

  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: ${env:DD_SITE}

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb, datadog]

That's it. Restart the Collector. Same Lambda, same OTLP payload, now three destinations.

For Datadog, this example uses the Collector's dedicated datadog exporter rather than generic otlp_http, because it handles Datadog-specific intake behavior, metadata mapping, and host tagging.

The Lambda Handler (OTLP Shipper)

Key Design Decisions

Why OTLP? — It gives the Lambda a single producer-to-Collector contract. The Collector then handles each backend's supported exporter or intake path. One format to maintain, not three.
Why no vendor SDK? — SDKs add cold start latency, dependency management, and vendor coupling. Pure urllib3 + JSON keeps the Lambda lean.
Why AUTH_MODE? — Different Collectors may need different auth. The Lambda supports none, basic, and bearer modes without code changes.

Field Mapping: FSx ONTAP → OTLP Attributes

The Lambda maps FSx ONTAP audit fields to semantic OTLP attribute keys:

FSx ONTAP Field	OTLP Attribute Key	Example Value
`EventID`	`event.type`	`4663`
`UserName`	`user.name`	`admin@corp.local`
`ClientIP`	`client.address`	`10.0.1.50`
`Operation`	`fsxn.operation`	`ReadData`
`ObjectName`	`fsxn.path`	`/vol/data/reports/q4.xlsx`
`Result`	`fsxn.result`	`Success`
`SVMName`	`fsxn.svm`	`svm-prod-01`

The examples above focus on S3 audit logs because they are the highest-volume path. The same OTLP shipper pattern is reused for EMS webhook events and FPolicy file operations using source-specific field mappers (ems_handler.py, fpolicy_handler.py), while preserving the same Collector-facing OTLP contract. For EMS and FPolicy, source-specific service names are used (fsxn-ems, fsxn-fpolicy) to distinguish event sources in the backend.

Resource-level attributes (set once per payload, not per log record):

Attribute	Value	Purpose
`service.name`	`fsxn-audit`	Service identification
`cloud.provider`	`aws`	Cloud context
`cloud.platform`	`aws_fsx`	Platform context

cloud.platform=aws_fsx is a project-specific value used to identify FSx for ONTAP as the data source. It is not part of the OpenTelemetry semantic conventions standard cloud.platform values (which include aws_ec2, aws_ecs, aws_eks, aws_lambda, etc.).

Severity Determination Logic

The Lambda determines OTLP severity from the Result field:

WARN_KEYWORDS = ("fail", "denied", "error")

def determine_severity(result: Optional[str]) -> tuple[int, str]:
    """Determine OTLP severity from FSx ONTAP Result field."""
    if not result:
        return (9, "INFO")
    lower = result.lower()
    for keyword in WARN_KEYWORDS:
        if keyword in lower:
            return (13, "WARN")
    return (9, "INFO")

This means failed access attempts (Result: "Failure") automatically get severityNumber: 13 (WARN), making them easy to filter in any backend.

The Lambda sets both severityNumber and severityText according to the OpenTelemetry Logs Data Model severity level definitions.

OTLP Payload Construction

def build_otlp_payload(
    logs: list[dict[str, Any]],
    service_name: str,
    source_key: str,
) -> dict[str, Any]:
    """Build OTLP Log Data Model payload."""
    log_records = [map_log_record(log) for log in logs]

    return {
        "resourceLogs": [{
            "resource": {
                "attributes": [
                    {"key": "service.name", "value": {"stringValue": service_name}},
                    {"key": "cloud.provider", "value": {"stringValue": "aws"}},
                    {"key": "cloud.platform", "value": {"stringValue": "aws_fsx"}},
                ]
            },
            "scopeLogs": [{
                "scope": {"name": "fsxn-otel-shipper", "version": "1.0.0"},
                "logRecords": log_records,
            }],
        }]
    }

No vendor SDK. No vendor-specific formatting. Just the OTLP Log Data Model.

Retry with Exponential Backoff

MAX_RETRIES = 3
BASE_INTERVAL = 2  # seconds

def _send_otlp_payload(payload, endpoint, auth_headers=None) -> bool:
    """Send OTLP payload via HTTP POST with retry logic.

    Retries on HTTP 429 and 5xx. Does not retry on 4xx (except 429).
    Exponential backoff: 2s, 4s, 8s with jitter.
    """
    url = f"{endpoint}/v1/logs"
    headers = {"Content-Type": "application/json"}
    if auth_headers:
        headers.update(auth_headers)

    json_body = json.dumps(payload).encode("utf-8")

    for attempt in range(MAX_RETRIES):
        response = http.request("POST", url, body=json_body, headers=headers, timeout=30.0)

        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            wait_time = BASE_INTERVAL * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait_time)
            continue
        # Client error (4xx except 429) — don't retry
        return False
    return False

AUTH_MODE Support

The Lambda supports three authentication modes via the AUTH_MODE environment variable:

AUTH_MODE	Behavior	Use Case
`none`	No auth headers sent	Local Collector (no auth needed)
`basic`	`Authorization: Basic <base64(token)>`	Grafana Cloud direct
`bearer`	`Authorization: Bearer <token>`	Generic OTLP endpoints

When using the Collector pattern, set AUTH_MODE=none on the Lambda — the Collector handles backend auth via its own config.

Direct auth modes (basic, bearer) are useful for testing or bypassing the Collector. In the multi-backend pattern, keep AUTH_MODE=none and let the Collector handle backend credentials.

Deployment

Local Development: Docker Run

# 1. Configure credentials
cd integrations/otel-collector
cp .env.example .env
# Edit .env with your backend credentials:
#   GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
#   GRAFANA_BASIC_AUTH=<base64(instanceId:apiToken)>
#   HONEYCOMB_API_KEY=hcaik_<your-ingest-key>
#   HONEYCOMB_DATASET=fsxn-audit

# 2. Start OTel Collector
docker run -d --name otel-collector \
  -p 4318:4318 -p 13133:13133 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  --env-file .env \
  otel/opentelemetry-collector-contrib:0.152.0

# 3. Verify health
curl -f http://localhost:13133/
# Expected: HTTP 200 — {"status":"Server available", ...}

The health_check extension confirms the Collector process is available; it does not guarantee that each backend exporter is successfully delivering logs. Monitor exporter errors separately using the Collector's internal telemetry metrics if enabled and exposed.

# 4. Send a test payload
bash scripts/generate-otlp-payload.sh --output /tmp/payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/payload.json

Colima users: docker compose v2 plugin is NOT available in Colima. All scripts in this repo detect this and fall back to docker run. If you see "docker compose: command not found", this is expected behavior.

First Success Path

If you're trying this for the first time, start small:

Run the Collector locally with one backend.
Send one fresh OTLP payload.
Confirm the event appears in that backend.
Add the second exporter.
Only then move to multi-backend or AWS deployment.

This keeps the first validation focused on the producer-to-Collector contract before introducing backend parity and production networking.

AWS Deployment: CloudFormation

aws cloudformation deploy \
  --template-file integrations/otel-collector/template.yaml \
  --stack-name fsxn-otel-integration \
  --parameter-overrides \
    S3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    OtlpEndpoint=http://<your-collector-endpoint>:4318 \
    ApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-otel-key-XXXXXX \
    AuthMode=none \
  --capabilities CAPABILITY_IAM \
  --region ap-northeast-1

This template deploys the Lambda-side OTLP shipper. The Collector endpoint must already be reachable from the Lambda — for example, a local Collector for development, an EC2-hosted Collector, or an ECS/Fargate-based Collector in the same VPC. If the Lambda is in a VPC, ensure security groups allow outbound TCP 4318 to the Collector. See the repository's VPC Deployment Guide and Security Hardening Guide for production Collector deployment.

When the Collector handles auth, set AuthMode=none on the Lambda. The Collector config contains the per-backend credentials via environment variables (sourced from .env or Secrets Manager in production).

Environment Variables

Variable	Lambda	Collector	Description
`OTLP_ENDPOINT`	✅	—	Collector URL (e.g., `http://collector:4318`)
`AUTH_MODE`	✅	—	`none` / `basic` / `bearer`
`SERVICE_NAME`	✅	—	OTLP `service.name` attribute
`GRAFANA_OTLP_ENDPOINT`	—	✅	Grafana Cloud OTLP gateway URL
`GRAFANA_BASIC_AUTH`	—	✅	base64(instanceId:apiToken)
`HONEYCOMB_API_KEY`	—	✅	Ingest key (hcaik_...)
`HONEYCOMB_DATASET`	—	✅	Dataset name
`DD_API_KEY`	—	✅	Datadog API key
`DD_SITE`	—	✅	Datadog site (`datadoghq.com`, `datadoghq.eu`, `ap1.datadoghq.com`, etc.)

Verified Results

All backends were tested on 2026-05-18 using otel/opentelemetry-collector-contrib:0.152.0:

Backend	Region/Site	Status	Event Sources	Auth Method
Datadog	ap1.datadoghq.com	✅ Verified	S3 audit + EMS + FPolicy	Datadog exporter (`DD-API-KEY`)
Grafana Cloud	ap-northeast-0	✅ Verified	S3 audit + EMS + FPolicy	Basic Auth via `otlp_http`
Honeycomb	—	✅ Verified	S3 audit + EMS + FPolicy	`x-honeycomb-team` via `otlp_http`
Multi-Backend	Grafana + Honeycomb	✅ Verified	Simultaneous delivery	Both auth methods
Multi-Backend	Datadog + Grafana + Honeycomb	✅ Verified	Simultaneous 3-way delivery	All three exporters

All three backends received the same structured attributes:

event.type, user.name, client.address
fsxn.operation, fsxn.path, fsxn.result, fsxn.svm
cloud.provider=aws, cloud.platform=aws_fsx

OTLP standardizes the producer-to-Collector contract, but backend-specific indexing, query semantics, and retention behavior still need to be validated per destination. OpenTelemetry is not a backend — it defines APIs, protocols, and Collector components for telemetry generation, collection, processing, and export. Storage, visualization, and alerting are handled by the backends themselves. See the Backend Parity Matrix and PoC Checklist for backend-specific validation details.

The Proof: Zero Code Changes

Here's the key evidence. The Lambda handler's SHA-256 hash is identical regardless of which backend receives the logs:

$ shasum -a 256 integrations/otel-collector/lambda/handler.py
# Same hash whether targeting Datadog, Grafana Cloud, or Honeycomb
# The file never changes — only the Collector config does

What changes between backends? Only the OTel Collector config file.

Demonstration: Adding a Backend

Starting state: Grafana Cloud only.

# Before: single backend
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]

Adding Honeycomb:

# After: add 5 lines to exporters section + update pipeline
exporters:
  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Restart the Collector. Done. No Lambda redeployment, no code review, no CI/CD pipeline for the shipper.

Demonstration: Removing a Backend

Dropping Datadog during a migration to Grafana Cloud:

# Remove from exporters list — that's it
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]  # removed: datadog

Troubleshooting

Timestamp Rejection / Static Payload Gotcha

Datadog documents that logs older than 18 hours are dropped at intake (Datadog Logs API docs). Other backends may also reject or hide events with timestamps outside their accepted windows. In my testing, future timestamps also caused ingestion issues on some backends. When testing with static payloads, always generate fresh timestamps.

Fix: Use the payload generator to create fresh timestamps:

bash scripts/generate-otlp-payload.sh --output /tmp/fresh-payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/fresh-payload.json

Grafana Cloud Auth Format

The loki exporter is NOT the correct approach for OTLP → Grafana Cloud.

❌ loki exporter with Loki push API
✅ otlp_http/grafana with OTLP gateway endpoint

The Basic Auth value must be base64(instanceId:apiToken):

# Generate the auth value
echo -n "<your-instance-id>:<your-grafana-cloud-api-token>" | base64

Where the instance ID is your numeric Grafana Cloud instance ID (found in Cloud Portal → OTLP configuration).

Honeycomb Key Types

Honeycomb has two key types. Only ingest keys work for data ingestion:

Key Prefix	Type	Works for OTLP?
`hcaik_`	Ingest API key	✅ Yes
`hcxik_`	Environment key	❌ No

If you see 401 Unauthorized from Honeycomb, check your key prefix.

Colima Docker Compose Compatibility

docker compose v2 plugin is not available in Colima environments. All scripts in this repository detect this automatically and fall back to docker run. This is expected — not an error.

If you need compose-like orchestration on Colima, use the explicit docker run commands shown in the Deployment section.

Common Mistake: loki Exporter vs otlp_http

A frequent misconfiguration when targeting Grafana Cloud:

# ❌ WRONG — loki exporter uses Loki-specific push API
exporters:
  loki:
    endpoint: https://logs-prod-<region>.grafana.net/loki/api/v1/push

# ✅ CORRECT — otlp_http uses the OTLP gateway
exporters:
  otlp_http/grafana:
    endpoint: https://otlp-gateway-prod-<region>.grafana.net/otlp

The OTLP gateway is Grafana Cloud's native OTLP ingestion endpoint. It handles logs, metrics, and traces through a single URL.

Cost Model: How to Think About It

Lambda Cost (OTLP Path vs Direct Send)

In my validation, the OTLP Lambda was simpler and shorter-lived than the vendor-specific direct-send path. Your duration will vary depending on batching, payload size, network path, and backend response time.

Component	Direct Send (Part 2)	OTLP + Collector
Lambda complexity	Vendor formatting + HTTP + retry	OTLP POST to nearby Collector
Lambda memory	256MB	256MB
Vendor SDK deps	Yes (adds cold start)	None
Retry complexity	Per-vendor	Delegated to Collector

OTel Collector Cost

The Collector introduces a fixed infrastructure cost that is independent of event volume:

Deployment	Best For
Docker on local machine	Development, testing
Docker on EC2 Spot (t3.small)	Low-volume production
ECS Fargate (0.5 vCPU, 1GB)	Production (no OS management)
ECS Fargate + NAT Gateway	VPC-internal production

When to Use Each Pattern

Scenario	Recommendation
Single vendor, low volume	Direct Send (Part 2 pattern) — no Collector overhead
Single vendor, high volume	Collector (buffering + backpressure benefits)
Multi-vendor evaluation	Collector (add/remove exporters freely)
Vendor migration in progress	Collector (parallel delivery during cutover)
Compliance: logs in multiple systems	Collector (fan-out is a config change)

The Collector has fixed infrastructure costs regardless of volume. As volume increases or vendors multiply, the Collector path becomes more cost-effective because it processes once and fans out. The Collector path centralizes fan-out outside the Lambda. Direct-send can also fan out within one Lambda, but that pushes vendor-specific formatting, retry behavior, and failure isolation back into application code.

Important: Backend ingest/retention costs are not included in these AWS-side estimates. Datadog, Grafana Cloud, and Honeycomb each have their own pricing models that can become the dominant cost at scale.

When to Use This Pattern

Multi-Vendor Evaluation

Want to try Honeycomb for a month alongside your existing Datadog setup? Add one exporter to the Collector config. No Lambda redeployment. No risk to your existing pipeline.

Compliance: Logs in Multiple Systems

Some organizations require audit logs in multiple systems — security team uses Splunk, dev team uses Datadog, compliance team needs a cold archive. The Collector fans out to all simultaneously from a single OTLP stream.

Migration Between Vendors

Moving from Datadog to Grafana Cloud? Run both exporters in parallel during migration. Verify data parity in the new system. Remove the old exporter when satisfied. Zero-downtime vendor migration.

Cost Optimization: Route by Volume

Use the Collector's processor pipeline to route high-volume noisy logs (read operations) to a cheaper backend while keeping security-critical events (deletes, permission changes) on a premium platform with alerting.

What's Next

For production hardening, the repository includes guides covering VPC deployment, health monitoring, persistent buffering, security hardening, and benchmarking. Auto-scaling and Multi-AZ deployment are natural next steps for production Collector operations.

For production and partner-led deployments, the repository includes:

Architecture Decision Record
VPC Deployment Guide — private networking, security groups, and Collector reachability from Lambda
Config Governance Guide
Security Hardening Guide
Operations Guide
Cost Model
PoC Checklist
Routing and Filtering Examples
Compliance Evidence Note
Migration Guide — zero-downtime migration from direct-send to the Collector path
OTel Semantic Mapping Guide — standard vs project-specific attributes, schema evolution, and what OTLP does not solve
Backend Parity Matrix — visibility and query behavior across Datadog, Grafana Cloud, and Honeycomb
Glossary / 用語集 — English/Japanese OTel terminology used in this project
Enterprise Workload Addendum — SAP, VMware, and mission-critical workload considerations
Storage Service Selection Note — when to use FSx for ONTAP, Amazon S3, Amazon EFS, and Amazon EBS

Key Takeaways

OTLP is the stable producer contract. Your Lambda speaks one protocol; the Collector handles backend-specific exporters.
OTel Collector is the routing and processing layer that decouples log producers from observability backends.
Zero Lambda code changes when switching or adding backends — verified with SHA-256 hash comparison.
Multi-backend delivery is a config change, not a code change. Add 5 lines of YAML, restart the Collector.
All three FSx ONTAP event sources work: FSx audit logs via S3 Access Point (Part 2), EMS webhooks (Part 3), and FPolicy file operations (Part 4).
Collector economics improve as volume increases or vendors multiply — fixed Collector cost is amortized across all destinations.
Start with direct send (Part 2) for simplicity. Graduate to the Collector when you need multi-backend, vendor migration, or volume-based routing.

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in with OTel Collector (this post)

Questions about the OTel Collector pattern or multi-backend delivery? Drop a comment below.

Previous: Part 4 — FPolicy File Activity Pipeline

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Mon, 18 May 2026 02:31:34 +0000

TL;DR

ONTAP FPolicy pushes file operation notifications over a persistent TCP connection. We run a lightweight Python server on ECS Fargate that receives these events, normalizes them, and forwards them to SQS → Lambda → Datadog. In my validation environment, create events reached Datadog in about 6 seconds. Rename/delete behavior depends on FPolicy mode, protocol, and FSx for ONTAP behavior, so this post documents both the working path and the limitations observed.

Update — production hardening path
This article remains the Datadog-specific introduction to the FPolicy file activity pipeline. Since publishing it, the repository has been expanded with production-readiness guidance, governance and security review checklists, sample payloads, CI policy, cfn-guard rules, and shared Python helpers for observability and idempotent object processing.

For production planning, start from the repository README:

Choose Your Path

Recommended first 30 minutes

Production Readiness Levels

PoC Success Criteria

Security Review Checklist

Governance and Compliance Guide

CI Policy

The FPolicy pattern has also been expanded with Persistent Store guidance, idempotent object processing, EventBridge dispatch, and a hybrid polling/event-driven migration path. This Part 4 article focuses on the Datadog delivery path; the repository now documents the broader production baseline.

Why FPolicy Needs Fargate

In Part 3, we showed how EMS webhooks deliver ARP alerts via API Gateway → Lambda. That works because EMS uses standard HTTPS.

FPolicy is different. ONTAP's FPolicy subsystem uses a proprietary binary protocol over persistent TCP connections. ONTAP initiates the connection to the FPolicy server and maintains it with periodic KeepAlive messages. This means:

❌ Lambda — No persistent TCP connections, max 15-minute timeout
❌ API Gateway — HTTP/HTTPS only, no raw TCP
✅ ECS Fargate — Persistent TCP listener, private IP, auto-restart

Why I Did Not Use an NLB in This Validation

I tested an NLB-based approach, but it did not work reliably in my validation. The issue was not that NLB cannot forward binary TCP traffic; it can. The challenge was FPolicy's stateful session negotiation and ONTAP's expectation of configured FPolicy server IPs. Health checks and connection behavior introduced additional complexity. For this validation, the simplest reliable path was to let ONTAP connect directly to the Fargate task's private IP and automate external-engine IP updates on task restart.

The Fargate task runs a Python server that:

Listens on TCP:9898
Handles FPolicy protocol negotiation (version handshake)
Receives KeepAlive messages (connection health)
Parses file operation notifications
Forwards structured events to SQS

Architecture

SMB/NFS Client
    │ file create/write/rename/delete
    ▼
FSx for ONTAP (FPolicy enabled)
    │ proprietary TCP protocol
    ▼
ECS Fargate (TCP:9898)
    │ parse → normalize → forward
    ▼
SQS Queue
    │ event source mapping
    ▼
Lambda (fpolicy_handler)
    │ format → ship
    ▼
Datadog Logs API v2 (source:fsxn-fpolicy)

Key design decisions:

ONTAP connects TO Fargate — the Fargate task must be reachable on a private IP. Because that IP can change on task restart, the ONTAP external engine must be updated automatically or operationally.
SQS decouples the TCP server from the shipping logic — if Datadog is slow, events buffer in SQS
Lambda handles Datadog shipping — retry logic, batch formatting, API key management
No NLB — ONTAP connects directly to the Fargate task's private IP

Production Boundary: Why FPolicy Needs More Than Lambda

The audit-log and EMS paths are natural fits for Lambda:

Audit logs are file/object reads through the FSx for ONTAP S3 Access Point read path
EMS events are HTTPS webhook payloads

FPolicy is different. ONTAP FPolicy uses a persistent TCP connection to an external FPolicy server. That makes it a poor fit for API Gateway + Lambda as the first receiver.

This is why the production-oriented path is:

ONTAP FPolicy
  → ECS Fargate TCP listener
  → SQS
  → Lambda shipper
  → Datadog

## Deployment

### Prerequisites

- FSx for ONTAP file system with a CIFS-enabled SVM
- VPC with private subnets (same as FSx for ONTAP)
- ECR repository with the FPolicy server image
- Private subnet egress for Fargate: either a NAT Gateway or VPC endpoints for ECR image pull, CloudWatch Logs, and SQS access

### Step 1: Deploy the Fargate Stack

bash
aws cloudformation deploy \
--template-file shared/templates/fpolicy-server-fargate.yaml \
--stack-name fsxn-fpolicy-server \
--parameter-overrides \
VpcId= \
SubnetIds= \
FsxnSvmSecurityGroupId= \
ContainerImage=.dkr.ecr..amazonaws.com/fsxn-fpolicy-server:latest \
--capabilities CAPABILITY_NAMED_IAM


This creates:
- ECS Cluster + Fargate Service (1 task)
- SQS Queue for FPolicy events
- Security Group (inbound TCP:9898 from FSx SG)
- CloudWatch Log Group

### Step 2: Deploy the Datadog Shipping Lambda

The template accepts the SQS queue ARN as a parameter and automatically creates the event source mapping:

bash

Get the SQS queue ARN from Step 1 outputs

SQS_ARN=$(aws cloudformation describe-stacks \
--stack-name fsxn-fpolicy-server \
--query "Stacks[0].Outputs[?OutputKey=='FPolicyQueueArn'].OutputValue" \
--output text)

aws cloudformation deploy \
--template-file integrations/datadog/template-ems-fpolicy.yaml \
--stack-name fsxn-datadog-ems-fpolicy \
--parameter-overrides \
DatadogApiKeySecretArn= \
DatadogSite=ap1.datadoghq.com \
FPolicySqsQueueArn=${SQS_ARN} \
--capabilities CAPABILITY_NAMED_IAM


This creates the Lambda function with an SQS event source mapping — no manual `create-event-source-mapping` needed.

### Step 3: Get the Fargate Task IP

bash
TASK_ARN=$(aws ecs list-tasks \
--cluster fsxn-fpolicy-server-cluster \
--service-name fsxn-fpolicy-server-service \
--query "taskArns[0]" --output text)

aws ecs describe-tasks \
--cluster fsxn-fpolicy-server-cluster \
--tasks $TASK_ARN \
--query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" \
--output text


## ONTAP FPolicy Configuration

> **CLI note**: Some ONTAP versions show these commands under `vserver fpolicy ...`, while newer CLI contexts may allow shortened forms. Use the command form supported by your ONTAP version. The examples below use the form validated in my environment (FSx for ONTAP 9.17.1). See [NetApp CLI reference](https://docs.netapp.com/us-en/ontap-cli-9151/vserver-fpolicy-policy-external-engine-create.html) for the full command syntax.

FPolicy requires three components: an External Engine (where to send events), an Event (what to monitor), and a Policy (linking them together).

### Create the External Engine

shell
vserver fpolicy policy external-engine create -vserver \
-engine-name fpolicy_aws_engine \
-primary-servers \
-port 9898 \
-extern-engine-type asynchronous \
-ssl-option no-auth


> **Production note**: For production deployments, evaluate `server-auth` or `mutual-auth` instead of `no-auth`, and validate certificate handling between ONTAP and the FPolicy server. See [NetApp FPolicy external engine documentation](https://docs.netapp.com/us-en/ontap/nas-audit/create-fpolicy-external-engine-task.html).

### Create the FPolicy Event

shell
vserver fpolicy policy event create -vserver \
-event-name cifs_file_events \
-protocol cifs \
-file-operations create,write,rename,delete


> **Tip**: For write-heavy workloads, review the protocol-specific FPolicy filters supported by your ONTAP version and protocol. Where supported, use close/modify-oriented filters to reduce duplicate or noisy write events.

### Create and Enable the Policy

shell
vserver fpolicy policy create -vserver \
-policy-name fpolicy_aws \
-events cifs_file_events \
-engine fpolicy_aws_engine \
-is-mandatory false

vserver fpolicy enable -vserver \
-policy-name fpolicy_aws \
-sequence-number 1


This example uses an asynchronous, non-mandatory policy so client file operations are not blocked by FPolicy server processing or Datadog delivery. If the FPolicy server is unavailable, file operations continue unimpeded — but notifications may be buffered or lost depending on your ONTAP version and configuration.

### Verify Connection

shell
vserver fpolicy show-engine -vserver -engine-name fpolicy_aws_engine


You should see `connected` status. In the ECS logs, KeepAlive messages confirm the connection:

console
[INFO] fpolicy-server: [+] Connection from ('10.0.x.x', 44107)
[INFO] fpolicy-server: [Handshake] Policy=fpolicy_aws | Session=... | VsUUID=...
[INFO] fpolicy-server: [Send] NEGO_RESP | Version=1.2 | Policy=fpolicy_aws
[INFO] fpolicy-server: [KeepAlive] Received — connection healthy


## E2E Validation Results

File operations on the SMB share produce events that flow through the entire pipeline:

| Operation | ECS Log | SQS | Lambda | Datadog | Latency |
|-----------|---------|-----|--------|---------|---------|
| create `blog_demo_create.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `blog_demo_write.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `confidential_report_2026.xlsx` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |

### ECS Fargate Logs — Connection Lifecycle

The FPolicy server logs show the complete lifecycle: server start → ONTAP connection → protocol handshake → KeepAlive → file events → SQS delivery.

![ECS Fargate CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-ecs-fpolicy-logs.png)

### Lambda CloudWatch Logs — Event Processing

Each SQS message triggers a Lambda invocation. Processing time is typically 30-50ms per event.

![Lambda CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-lambda-fpolicy-logs.png)

### Datadog Log Explorer

Query: `source:fsxn-fpolicy`

Each event contains structured attributes:
- `operation_type`: The file operation (create, write, rename, delete)
- `file_path`: The file that was operated on
- `client_ip`: The client that performed the operation
- `volume_name`: The ONTAP volume
- `svm`: The ONTAP SVM name (may show "unknown" if not resolved from handshake context)
- `timestamp`: When the operation occurred

![FPolicy events in Datadog Log Explorer](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-full-path.png)

![FPolicy event detail — structured attributes visible in the side panel](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-detail.png)

## Correlating FPolicy with ARP

The real power emerges when you combine FPolicy file activity with ARP ransomware detection from Part 3:

plaintext
source:(fsxn-fpolicy OR fsxn-ems) @attributes.svm:svm-prod-01


This correlation query shows:
1. **ARP alert** (from EMS): "Ransomware detected on volume X"
2. **File operations** (from FPolicy): Which user, from which IP, created/renamed which files

Together they answer the critical incident response questions: *What happened, who did it, and from where?*

### Security Use Case: Detecting Suspicious File Creation Bursts

With FPolicy create events in Datadog, you can create a Monitor that fires when a single client creates more than 50 files in 5 minutes — a potential indicator of ransomware encryption or unauthorized bulk operations:

**Datadog Monitor query:**

plaintext
logs("source:fsxn-fpolicy @attributes.operation_type:create").rollup("count").by("@attributes.client_ip").last("5m") > 50


**Alert message:**

plaintext
🚨 Suspicious file creation burst detected on FSx for ONTAP

Client IP: {{@attributes.client_ip}}
Volume: {{@attributes.volume_name}}
Count: {{value}} file creations in 5 minutes

Investigate immediately — check if this is authorized batch processing or potential ransomware activity.


> **Note on delete monitoring**: If your FPolicy configuration and ONTAP version reliably deliver delete events (e.g., synchronous mode or a future ONTAP release), you can extend this pattern to bulk deletion detection. In my async-mode validation, delete notifications were not reliably delivered — I recommend using audit logs from [Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c) for delete-event completeness.

This is difficult to achieve with traditional audit log polling, which depends on rotation and scheduler intervals. FPolicy's event-driven delivery makes sub-minute detection possible for the operations it reliably captures.

## Operational Considerations

### Fargate Task IP Changes

When a Fargate task restarts (deployment, crash, scaling), it gets a new private IP. ONTAP's External Engine must be updated with the new IP. Options:

1. **Manual update**: `vserver fpolicy policy external-engine modify -primary-servers <new-ip>`
2. **Automated**: Lambda triggered by ECS task state change → ONTAP REST API update

The repository includes a helper script (`shared/scripts/fpolicy-update-engine-ip.sh --auto`) that detects the current task IP and updates the ONTAP engine. For full automation, wire an EventBridge rule on ECS task state changes to an update Lambda — this is not included in the base stack but is straightforward to add. Automated updates require network reachability to the ONTAP management endpoint and credentials (stored in Secrets Manager) with permission to modify the FPolicy external engine.

### Restart Resilience — Validated

I tested the full restart cycle to confirm the pipeline recovers gracefully:

| Step | Result | Time |
|------|--------|------|
| Stop Fargate (scale to 0) | Task stopped | ~30s |
| Restart Fargate (scale to 1) | New task, new IP | ~45s |
| Update ONTAP Engine IP | Reconnection | ~20s |
| File operation after restart | Event delivered to Datadog | ~6s |
| **Total recovery time** | | **~2 minutes** |

The Lambda's retry logic also proved itself: on the first request after reconnection, a transient `RemoteDisconnected` error occurred. The exponential backoff retry succeeded on the second attempt — exactly the behavior we designed for.

console
[WARNING] HTTP error shipping to Datadog (attempt 1/3): RemoteDisconnected
[INFO] Processing complete: {"statusCode": 200, "body": {"shipped": 1}}


### Cost Profile

| Component | Monthly Cost (estimate) |
|-----------|------------------------|
| Fargate (0.25 vCPU, 0.5 GB) | ~$10 |
| SQS (low volume) | < $1 |
| Lambda (event-driven) | < $1 |
| CloudWatch Logs | ~$2 |
| **Total** | **~$14/month** |

Compare this to an always-on EC2-based collector, plus OS patching, agent management, and HA considerations. Exact EC2 costs vary by region and instance type.

> This is an AWS-side estimate and excludes Datadog ingest/retention costs, NAT Gateway or VPC endpoint charges, ECR storage, and high-volume CloudWatch Logs.

### Scaling

A single Fargate task is sufficient for the low-volume validation scenarios in this post. The architecture can scale by tuning Fargate CPU/memory, SQS buffering, and Lambda concurrency, but you should benchmark your own workload before assuming a specific events/sec capacity.

### Monitoring

Key CloudWatch metrics to watch:
- `ECS/CPUUtilization` — Fargate task health
- `SQS/ApproximateNumberOfMessagesVisible` — Queue depth (should stay near 0)
- `Lambda/Errors` — Shipping failures
- `Lambda/Duration` — Processing time per batch

## The FPolicy Server

The FPolicy server (`shared/fpolicy-server/fpolicy_server.py`) implements:

- **Protocol negotiation**: Responds to ONTAP's version handshake
- **KeepAlive handling**: Acknowledges connection health checks
- **Event parsing**: Extracts file path, operation, user, client IP from binary frames
- **SQS forwarding**: Sends normalized JSON events to the queue
- **Write coalescing**: Configurable delay to batch rapid write events (default: 5 seconds)

The server runs in `realtime` mode — events are forwarded as they arrive, with optional write-complete delay to avoid duplicate notifications for multi-write operations.

## Limitations and Future Work

### Rename/Delete Events Not Delivered in Async Mode

In my E2E testing, ONTAP did not deliver rename or delete notifications to the FPolicy server in asynchronous mode — even though these operations are configured in the FPolicy event definition. Only create events were reliably delivered. This appears to be a limitation of FSx for ONTAP's FPolicy implementation in async mode for certain operation types.

**Workaround options:**
- Use synchronous mode (adds latency to file operations — not recommended for production)
- Combine FPolicy (event-driven create) with audit log polling (catches rename/delete in EVTX)
- Accept create-only monitoring for event-driven alerting, use audit logs for forensic completeness

### NFS Protocol Support

| Protocol | FPolicy Support | Notes |
|----------|----------------|-------|
| SMB/CIFS | ✅ Verified | Primary validation protocol |
| NFSv3 | ✅ Supported | Requires explicit `vers=3` mount option |
| NFSv4.0 | ✅ Supported | Requires explicit `vers=4.0` |
| NFSv4.1 | ✅ Supported | Requires ONTAP 9.15.1+, explicit `vers=4.1` |
| NFSv4.2 | ❌ Not supported | ONTAP FPolicy does not monitor NFSv4.2 operations |

For protocol support details, verify your ONTAP version. NetApp [documents](https://kb.netapp.com/onprem/ontap/da/NAS/Does_ONTAP_support_FPolicy_for_NFS_4.2) that FPolicy does not currently support NFSv4.2; supported NFS protocols include NFSv3, NFSv4.0, and NFSv4.1 (ONTAP 9.15.1+).

**Critical gotcha:** `mount -o vers=4` on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does **not** support. Always use explicit version: `mount -o vers=4.1` or `vers=3`.

**NFS + FPolicy latency:** NFSv3 lacks close semantics, so the FPolicy server cannot know when a write is complete. The server uses a configurable `WRITE_COMPLETE_DELAY_SEC` (default: 5s) to wait before forwarding the event. This adds latency but prevents premature processing of incomplete files.

**NFS write hang (observed):** In some configurations, NFS write operations may hang when FPolicy is enabled — even with `is-mandatory=false`. This is a [known ONTAP behavior](https://kb.netapp.com/onprem/ontap/da/NAS/NFS_hung_slowness_issue_when_dealing_with_long_path_names_with_FPolicy_enabled) related to FPolicy notification processing. If you experience this, verify your ONTAP version and consider limiting FPolicy scope to specific volumes.

### User Identity

In the current implementation, the `user` field may be empty for some operations depending on ONTAP's FPolicy notification content. The FPolicy binary frame includes user identity in extended attributes that require additional parsing logic. Future versions will extract this from the NOTI_REQ body.

### Event Durability During Restarts

In my validation, events generated while the Fargate server was disconnected were not observed downstream in Datadog after reconnection. Treat FPolicy delivery during server outages as something you must validate in your own environment.

ONTAP [documentation](https://docs.netapp.com/us-en/ontap/nas-audit/synchronous-asynchronous-notifications-concept.html) describes buffering behavior for asynchronous notifications — notifications generated during a network outage are stored on the storage node and can be fetched when the server comes back online. Beginning with ONTAP 9.14.1, [FPolicy persistent store](https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html) support is available for asynchronous non-mandatory policies. If you cannot tolerate event loss during FPolicy server restarts, evaluate persistent store and validate the behavior on your FSx for ONTAP version.

## Try It Yourself

bash

Clone the repository

git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git

Deploy prerequisites (if not already done)

aws cloudformation deploy \
--template-file shared/templates/fpolicy-server-fargate.yaml \
--stack-name fsxn-fpolicy-server \
--parameter-overrides \
VpcId= \
SubnetIds= \
FsxnSvmSecurityGroupId= \
ContainerImage= \
--capabilities CAPABILITY_NAMED_IAM

Configure ONTAP FPolicy (see ONTAP section above)

Create a file on the SMB share

Check Datadog: source:fsxn-fpolicy




## Where FPolicy Fits in ONTAP Telemetry

This series covers three ONTAP telemetry sources. Each serves a different purpose:

| Use Case | Best Source | Latency | Coverage |
|----------|-------------|---------|----------|
| Compliance audit trail | Audit logs ([Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)) | Minutes (scheduler interval) | Complete historical record |
| Ransomware detection | ARP via EMS ([Part 3](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)) | ~30 seconds (webhook) | ML-based pattern detection |
| Event-driven file activity signal | FPolicy (this post) | ~6 seconds (TCP) | Create events validated; other operations depend on mode/version |
| Forensic investigation | Audit logs + FPolicy correlation | Combined | Timeline reconstruction |

**FPolicy is not a replacement for audit logs.** It provides an event-driven signal for detection and alerting. Audit logs provide the authoritative, complete historical record for compliance and forensics. Use them together.

## Key Takeaways

1. **Use Fargate for FPolicy TCP listener** — Lambda cannot maintain persistent TCP connections. Fargate provides the long-running listener without OS management.
2. **Use SQS to decouple ingestion from shipping** — If Datadog is slow or Lambda is throttled, events buffer safely in SQS.
3. **Validate operation coverage in your environment** — Async mode reliably delivered create events in my testing. Rename/delete behavior varies by ONTAP version and mode.
4. **Use audit logs for forensic completeness** — FPolicy provides event-driven signal for detection; audit logs (Part 2) provide the complete historical record.
5. **Treat FPolicy as event-driven alerting, not full audit replacement** — The two are complementary, not interchangeable.

## Production Considerations Beyond This Validation

This post validates the end-to-end path. For production deployments, the following topics warrant additional design work:

| Topic | Key Questions |
|-------|--------------|
| **HA / Multi-AZ** | ONTAP external engine supports `primary-servers` and `secondary-servers`. How to run multiple Fargate tasks across AZs? |
| **Scope Design** | Which volumes, operations, and protocols to monitor? How to avoid noisy workloads? |
| **Security Hardening** | TLS/mTLS for FPolicy, ECR image scanning, VPC Flow Logs, task role least-privilege |
| **Cost Model** | FPolicy generates events per file operation — Datadog ingest can become the dominant cost at scale |
| **Operations Runbook** | Task restart, engine disconnected, SQS backlog, Datadog missing events, NFS hang |
| **Stable Endpoint** | Auto-update Lambda for engine IP, or primary/secondary server design for zero-downtime restarts |

These topics are documented in the repository:

- **[Production Architecture Patterns](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-production-architecture-patterns.md)** — Single task, primary/secondary, auto-update, multi-AZ patterns with failure mode matrix
- **[Operational Guide](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-operational-guide.md)** — 4-layer health model, runbooks, IP reconciliation, synthetic health check
- **[PoC Checklist](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-poc-checklist.md)** — Preconditions, scope, validation steps, success criteria, go/no-go

Contributions and questions are welcome.

## Series Navigation

- **Part 1**: [Why Your FSx for ONTAP Logs Deserve Better](https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod)
- **Part 2**: [Shipping FSx for ONTAP Logs to Datadog, The Serverless Way](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)
- **Part 3**: [Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)
- **Part 4**: FPolicy File Activity Pipeline (this post)

Coming next:
- **Splunk**: Replacing EC2 + Universal Forwarder with Lambda + HEC
- **OpenTelemetry**: The vendor-neutral escape hatch

---

*Questions about FPolicy or the Fargate architecture? Drop a comment below.*

*Previous: [Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)*

**GitHub**: [github.com/Yoshiki0705/fsxn-observability-integrations](https://github.com/Yoshiki0705/fsxn-observability-integrations)

Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 18:21:39 +0000

TL;DR

Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.

Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.

This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:

Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints

Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.

The solution

A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:

graph LR
    A[Auto-Expand Request] --> B{GuardrailMode?}
    B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
    B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
    B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
    C --> F[DynamoDB Tracking]
    D --> F
    E --> F
    F --> G[CloudWatch EMF Metrics]

Mode	Behavior on Check Failure	Behavior on DynamoDB Error
`DRY_RUN`	Log warning, allow action	Fail-open (allow)
`ENFORCE`	Block action, emit metric	Fail-closed (deny)
`BREAK_GLASS`	Skip all checks	SNS alert + audit log

Core implementation

from shared.guardrails import CapacityGuardrail, GuardrailMode

guardrail = CapacityGuardrail()  # Mode from GUARDRAIL_MODE env var

result = guardrail.check_and_execute(
    action_type="volume_grow",
    requested_gb=50.0,
    execute_fn=my_grow_function,
    volume_id="vol-abc123",
)

if result.allowed:
    print(f"Action executed: {result.action_id}")
else:
    print(f"Action denied: {result.reason}")
    # Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active

Three safety checks (ENFORCE mode)

Rate limit: Max 10 actions per day per action type
Daily cap: Max 500 GB cumulative expansion per day
Cooldown: 300-second minimum interval between actions

All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).

DynamoDB tracking schema

Attribute	Type	Description
`pk`	String	Action type (e.g., `volume_grow`)
`sk`	String	Date (`YYYY-MM-DD`)
`daily_total_gb`	Number	Cumulative GB expanded today
`action_count`	Number	Number of actions today
`last_action_ts`	String	ISO timestamp of last action
`actions`	List	Audit trail of all actions
`ttl`	Number	30-day auto-expiry

BREAK_GLASS production considerations

In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.

2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.

The solution

A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:

sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM->>Lambda: Step 1: createSecret
    Lambda->>SM: Generate new password, store as AWSPENDING

    SM->>Lambda: Step 2: setSecret
    Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP-->>Lambda: 200 OK

    SM->>Lambda: Step 3: testSecret
    Lambda->>ONTAP: GET /api/cluster (using new password)
    ONTAP-->>Lambda: 200 OK (cluster UUID returned)

    SM->>Lambda: Step 4: finishSecret
    Lambda->>SM: Promote AWSPENDING → AWSCURRENT

Key design decisions

VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
90-day interval: Configurable via CloudFormation parameter
Validation: Step 3 (testSecret) verifies the new password works by calling the ONTAP cluster API
Rollback safety: If testSecret fails, the old password remains as AWSCURRENT

Bugs discovered during live testing

Three bugs were found and fixed during the actual rotation execution:

AWSPENDING empty check: createSecret must handle the case where get_secret_value(VersionStage='AWSPENDING') raises ResourceNotFoundException
management_ip fallback: The Lambda must support both management_ip (new) and ontap_mgmt_ip (legacy) keys in the secret JSON
Cluster UUID validation: testSecret now validates the response contains a valid uuid field, not just HTTP 200

Verification result

Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT

Operational note

Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).

For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.

3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.

The solution

A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:

ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
S3 Access Point Check: ListObjectsV2 against the S3AP alias

Critical finding: network-origin and endpoint configuration matter

During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.

This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.

In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:

Check	Observed requirement in this environment	Result
ONTAP REST API	VPC-internal access to management LIF	✅ Works
S3AP health check	Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy	⚠️ Timed out from the initial VPC Canary configuration

Solution: Split into two monitoring paths:

ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)

This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.

Canary runtime version lesson

The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.

AWS builder lesson: VPC placement is a design choice

A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.

4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.

The solution

A Lambda function running on a daily EventBridge schedule:

Fetches 30 days of FSx StorageUsed metrics from CloudWatch
Performs linear regression using only Python's math module (zero external dependencies)
Publishes DaysUntilFull as a CloudWatch custom metric
Sends SNS alert when forecast drops below threshold (default: 30 days)

Linear regression implementation (stdlib only)

def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
    """Least-squares linear regression using only math module."""
    n = len(data_points)
    if n < 2:
        raise ValueError("Need at least 2 data points for regression")

    sum_x = sum_y = sum_xy = sum_x2 = 0.0
    for x, y in data_points:
        sum_x += x
        sum_y += y
        sum_xy += x * y
        sum_x2 += x * x

    denominator = n * sum_x2 - sum_x * sum_x
    if abs(denominator) < 1e-10:
        return (0.0, sum_y / n)

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n
    return (slope, intercept)

Edge cases handled

Scenario	DaysUntilFull	Behavior
< 2 data points	-1	Insufficient data, no prediction
slope ≤ 0 (shrinking/flat)	-1	Never fills up
Already over capacity	0	Immediate alert
Very low usage (0.03%)	169,374	Normal — far future prediction

Live verification

{
  "days_until_full": 169374,
  "current_usage_pct": 0.03,
  "total_capacity_gb": 1024.0,
  "growth_rate_gb_per_day": 0.006,
  "forecast_date": "2490-02-06T06:26:42Z"
}

The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.

This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.

5. Data Lineage Tracking — DynamoDB with GSI

The problem

When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.

The solution

A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:

graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key<br/>SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] -->|PK lookup| PK
    Q2[Query by UC + time range] -->|GSI query| GSI
    Q3[Query by execution ARN] -->|Scan + filter| PK

For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.

Integration helper (opt-in)

from shared.lineage import LineageTracker, LineageRecord

tracker = LineageTracker()
record = LineageRecord(
    source_file_key="/vol1/legal/contracts/deal-001.pdf",
    processing_timestamp="2026-05-16T14:30:45.123Z",
    step_functions_execution_arn="arn:aws:states:...:execution:...",
    uc_id="legal-compliance",
    output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
    status="success",
    duration_ms=4523,
)
lineage_id = tracker.record(record)

Design principles

Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
TTL: 365-day auto-expiry via DynamoDB TTL (configurable via LINEAGE_TTL_DAYS environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)
Opt-in: UCs integrate by importing the helper — no mandatory coupling
PAY_PER_REQUEST: No capacity planning needed for variable workloads

Future: compliance-grade lineage (v2)

For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:

Field	Purpose
`input_checksum`	SHA-256 of source file for integrity verification
`output_checksum`	SHA-256 of generated output
`fpolicy_sequence_number`	ONTAP-assigned sequence for ordering
`policy_version`	FPolicy policy configuration version
`uc_template_version`	UC CloudFormation template version
`guardrail_mode`	Active guardrail mode at processing time
`retention_profile`	Retention class for compliance tiering

For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.

6. Protobuf TCP Framing — Adaptive Reader

The problem

Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.

The solution

An adaptive ProtobufFrameReader that supports three framing modes:

graph TD
    A[Incoming TCP Stream] --> B{FramingMode}
    B -->|AUTO_DETECT| C[Probe first 4 bytes]
    C -->|Valid uint32 length| D[LENGTH_PREFIXED]
    C -->|Otherwise| E[FRAMELESS]
    B -->|LENGTH_PREFIXED| D
    B -->|FRAMELESS| E
    D --> F[4-byte big-endian header → payload]
    E --> G[varint-delimited → payload]
    F --> H[Decoded Message]
    G --> H

Three modes

Mode	Wire Format	Use Case
`LENGTH_PREFIXED`	4-byte big-endian length + payload	XML mode (legacy)
`FRAMELESS`	varint-delimited protobuf	Protobuf mode (ONTAP 9.15.1+)
`AUTO_DETECT`	Probe first bytes, then lock mode	Unknown/mixed environments

Auto-detection heuristic

async def _auto_detect_and_read(self) -> bytes | None:
    """Probe first 4 bytes to determine framing mode."""
    peek = await self._reader.readexactly(4)
    candidate_length = struct.unpack("!I", peek)[0]

    if 0 < candidate_length <= self._max_message_size:
        # Valid length header → LENGTH_PREFIXED
        self._detected_mode = FramingMode.LENGTH_PREFIXED
        payload = await self._reader.readexactly(candidate_length)
        return payload
    else:
        # Not a valid length → FRAMELESS (varint-delimited)
        self._detected_mode = FramingMode.FRAMELESS
        self._buffer = peek
        return await self._read_varint_delimited()

Safety features

Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
FramingError exception: Structured error with offset and raw data for debugging
Graceful EOF handling: Returns None on connection close without raising

Integration with existing FPolicy server

from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2

# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)

Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.

Phase 13 protobuf validation scope

The following questions will be confirmed with NetApp support during live wire validation:

Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
Message boundary behavior under high throughput
Keep-alive behavior in protobuf mode vs XML mode
Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
Mixed-mode migration path (XML → protobuf transition without event loss)
Maximum message size guidance from ONTAP side

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.

The solution

Four SLO targets covering the critical path of the event-driven pipeline:

SLO	Metric	Target	SLO met when
Event Ingestion Latency	`EventIngestionLatency_ms`	P99 < 5,000 ms	LessThanThreshold
Processing Success Rate	`ProcessingSuccessRate_pct`	> 99.5%	GreaterThanThreshold
Reconnect Time	`FPolicyReconnectTime_sec`	< 30 sec	LessThanThreshold
Replay Completion Time	`ReplayCompletionTime_sec`	< 300 sec (5 min)	LessThanThreshold

For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".

CloudWatch Dashboard

The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):

from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets

# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
    status = "MET" if r.met else "VIOLATED"
    print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")

# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")

Alarm-based violation detection

Each SLO has a corresponding CloudWatch Alarm:

Alarm Name	State	Evaluation
`fsxn-s3ap-slo-ingestion-latency`	OK	3 consecutive periods
`fsxn-s3ap-slo-success-rate`	OK	3 consecutive periods
`fsxn-s3ap-slo-reconnect-time`	OK	3 consecutive periods
`fsxn-s3ap-slo-replay-completion`	OK	3 consecutive periods

All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.

8. FPolicy Pipeline E2E Verification

The problem

Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.

The verification

sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
    ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
    FP->>FP: Parse event, extract metadata
    FP->>SQS: SendMessage (JSON payload)
    SQS-->>SQS: Message available for consumers

Timeline (actual observed)

Time	Event	Detail
T+0s	TCP connection test	ONTAP → Fargate IP (10.0.128.98:9898)
T+10s	Session established	NEGO_REQ → NEGO_RESP handshake
T+12s	KEEP_ALIVE starts	2-minute interval
T+30s	NFS file created	`echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt`
T+31s	NOTI_REQ received	FPolicy server receives file creation event
T+32s	SQS delivery	Event sent to SQS queue (FPolicy_Q)

SQS message format

{
  "event_type": "FILE_CREATE",
  "svm_name": "FSxN_OnPre",
  "volume_name": "vol1",
  "file_path": "/vol1/test_fpolicy_event.txt",
  "client_ip": "10.0.128.98",
  "timestamp": "2026-05-16T08:45:32Z",
  "session_id": 1,
  "sequence_number": 1
}

IAM issue discovered and fixed

The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.

Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.

Event contract assumptions

The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:

Duplicate events can occur (especially during Persistent Store replay)
Delivery order is not guaranteed (confirmed in Section 9)
Consumers must be idempotent
file_path + timestamp + sequence_number serves as an idempotency key candidate
Replay events may arrive after newer events
Schema versioning should be introduced before multi-UC production rollout

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.

Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.

The test procedure

Stop Fargate task (ECS stop-task)
Create 5 files via NFS during downtime (replay-test-1.txt through replay-test-5.txt)
Wait for ECS service auto-recovery (new task launch)
Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
Verify all 5 events arrive in SQS

Results

Metric	Value
Events generated during downtime	5
Events replayed to SQS	5
Lost events	0
Replay delivery order	3, 1, 2, 5, 4 (non-sequential)
Replay completion time	~30 seconds

Key observation: Out-of-order replay

Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:

Idempotency: Deduplicate by file path + timestamp
Timestamp-based ordering: Sort by event timestamp, not arrival order

20-file burst validation

Additionally, a 20-file burst test confirmed zero event loss under higher load:

Test	Files Created	Events Delivered	Loss
Replay (5 files)	5	5	0
Burst (20 files)	20	20	0

Phase 13 replay storm metrics

The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:

Metric	Purpose
Persistent Store volume usage before/after replay	Capacity planning for the store volume
Events queued vs events replayed	Completeness verification
Replay throughput (events/sec)	Performance baseline
Replay duration	SLO calibration
Out-of-order distance	Downstream buffer sizing
Duplicate events	Idempotency requirement validation
ONTAP EMS logs around disconnect/reconnect	Root cause correlation

Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.

Operational framing: event durability as RPO/RTO

Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.

Phase 12 validation scope

Scope	Phase 12 Assumption	Production Consideration
SVM	Single SVM validation	Multi-SVM needs per-SVM policy and Persistent Store planning
Volume	Test volume	Production volumes should be grouped by UC/event profile
Protocol	NFS-based E2E test	NFSv3/NFSv4.1/SMB replay validation remains Phase 13
Event types	File create	Modify/delete/rename validation remains Phase 13
FPolicy mode	Async non-mandatory	Required for Persistent Store (NetApp docs)

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.

The approach

Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:

Property Group	Properties	Tests	Bugs Found
Protobuf Frame Reader	5 (round-trip, max size, EOF, multi-message, auto-detect)	18	1
Capacity Guardrails	4 (mode behavior, rate limit, daily cap, cooldown)	14	1
Data Lineage	3 (record/query round-trip, GSI consistency, TTL)	9	0
SLO Evaluation	2 (threshold comparison, no-data handling)	6	1
Capacity Forecast	2 (regression accuracy, edge cases)	6	0
Total	16	53	3

Bugs discovered

Protobuf reader: AUTO_DETECT mode failed when the first 4 bytes happened to form a valid-looking length that exceeded max_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.
Guardrails: BREAK_GLASS mode didn't emit the GuardrailBypass metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.
SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), max(datapoints, key=lambda dp: dp["Timestamp"]) was non-deterministic. Fix: add secondary sort by value.

Example property test

@given(messages=st.lists(
    st.binary(min_size=1, max_size=1000),
    min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
    """Property: LENGTH_PREFIXED encode → decode preserves all messages."""
    stream_data = _make_length_prefixed_stream(messages)
    reader = _make_stream_reader(stream_data)
    frame_reader = ProtobufFrameReader(
        reader=reader,
        mode=FramingMode.LENGTH_PREFIXED,
        max_message_size=max(len(m) for m in messages) + 1,
    )

    decoded = []
    for _ in range(len(messages)):
        msg = asyncio.run(frame_reader.read_message())
        assert msg is not None
        decoded.append(msg)

    assert decoded == messages  # Round-trip property

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.

In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.

Multi-layer authorization model

graph TD
    Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
    IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP -->|resource policy| FS{Layer 3: File System Identity}
    FS -->|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.->|❌ Denied| Block1[Access Denied]
    AP -.->|❌ Denied| Block2[Access Denied]
    FS -.->|❌ No permission| Block3[Access Denied]

AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.

Correct IAM ARN format

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}

Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.

VPC network constraint (environment-specific observation)

Access Pattern	Observed Result	Notes
VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)	⚠️ Timeout in this config	Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment
Internet → S3 AP (NetworkOrigin=Internet)	✅	Routes correctly with valid IAM credentials
VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)	Supported per AWS docs; not verified in Phase 12	Requires VPC-origin AP and matching endpoint policy
VPC Lambda → ONTAP REST API	✅	Direct management LIF access

Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.

Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:

Run outside VPC (with Internet access)
Use NAT Gateway for outbound routing
Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions

Write support and practical constraints

FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.

However, S3 Access Points are not full S3 buckets. Key constraints include:

Maximum upload size: 5 GB
Only FSX_ONTAP storage class
Only SSE-FSX encryption
No ACLs (except bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs

All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.

12. Cross-Project Feedback — Template Hardening

During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:

Security Group: SourceSecurityGroupId over CIDR

Before (broad):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 9898
    ToPort: 9898
    CidrIp: "10.0.0.0/8"

After (precise):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: !Ref FPolicyPort
    ToPort: !Ref FPolicyPort
    SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
    Description: FPolicy TCP from FSxN SVM Security Group

This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.

ONTAP CLI: Deprecated `vserver` prefix

ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:

# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...

# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...

KMS Decrypt: When it's needed (and when it's not)

Added documentation clarifying SQS encryption behavior:

SqsManagedSseEnabled: true → kms:Decrypt is NOT needed (transparent)
KmsMasterKeyId: alias/aws/sqs → kms:Decrypt IS needed

Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.

EC2 AMI: Removed redundant Docker install

ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.

Cpu/Memory: String type is intentional

Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.

13. What's Next — Phase 13 Outlook

Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:

✅ Capacity guardrails preventing runaway auto-scaling
✅ Automated secrets rotation on 90-day cycle
✅ Proactive capacity forecasting with daily predictions
✅ SLO-based observability with alarm-driven alerting
✅ Data lineage tracking for audit and debugging
✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
✅ Property-based testing catching real bugs

Ownership boundary

Layer	Primary Owner	Examples
Shared event platform	Platform / storage team	FPolicy server, SQS queue, EventBridge bus, Persistent Store
ONTAP operations	Storage team	SVM, volume, FPolicy policy, Persistent Store capacity
Security operations	Security / platform team	Secrets rotation, BREAK_GLASS approval, IAM policies
Workload UC	Application / data team	Step Functions, UC routing rules, output destinations
Observability	Platform + workload teams	SLO dashboard, UC-specific alarms, runbooks

Production Readiness Matrix

Capability	Phase 12 Status	Remaining Work
Capacity Guardrails	Verified (DRY_RUN/ENFORCE/BREAK_GLASS)	Approval workflow optional
Secrets Rotation	4-step rotation verified	Ensure all clients read from Secrets Manager
SLO Dashboard	Deployed, 4 alarms active	Runbooks and alarm response automation in Phase 13
Persistent Store Replay	5-event + 20-event scenarios verified	1000+ replay storm testing
S3AP Monitoring	ONTAP health path verified	Split S3AP health check (VPC-external)
Protobuf Framing	Property/integration tested	Live ONTAP protobuf wire validation
Multi-account OAM	Stack deployed conditionally	Second-account validation
Production UC E2E	Pipeline verified to SQS delivery	Full TriggerMode=EVENT_DRIVEN UC flow
Cost Dashboard	Not yet deployed	Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation

Phase 13 candidates

Operational readiness:

Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior

Enterprise deployment:

Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
Production UC end-to-end: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify the complete flow from NFS file creation through Step Functions execution to output generation

Protocol and cost:

Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate AUTO_DETECT mode against real ONTAP protobuf traffic
Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics

Decision trees and operational guides:

Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover

Cost model awareness

While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:

Category	Cost Type	Driver
FPolicy server (Fargate/EC2)	Fixed baseline	Always-on listener
NAT Gateway	Fixed + per-GB	Required if VPC Lambda needs Internet-origin S3AP access
CloudWatch Synthetics	Per-canary-run	5-minute interval = 8,640 runs/month
CloudWatch custom metrics + Logs	Per-metric + per-GB ingested	SLO metrics, FPolicy server logs
DynamoDB (lineage + guardrails)	Per-request (PAY_PER_REQUEST)	Event volume dependent
SQS / EventBridge	Per-message / per-event	Event volume dependent
Persistent Store volume	Per-GB provisioned	Sized for max queued events during downtime

Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).

NetworkOrigin decision table

Based on AWS documentation, the following decision criteria apply:

Choose VPC-origin when:

All consumers are Lambda/ECS/EC2 inside the same VPC
Private connectivity is mandatory (no internet-routed path allowed)
VPC endpoint policy is part of the security boundary
Network restriction is built-in (cannot be accidentally misconfigured)

Choose Internet-origin when:

External accounts or on-premises clients need access
Consumers are outside the bound VPC
Internet-routed access with IAM controls is acceptable
Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC

Factor	VPC-origin	Internet-origin
Network enforcement	Built-in explicit Deny for non-VPC traffic	Policy-based only
VPC endpoint required	Yes (Gateway or Interface in bound VPC)	Only if using `aws:SourceVpc` conditions
Multi-VPC access	Via Interface endpoint + peering/TGW to bound VPC	Via policy conditions
Change access scope	Must recreate access point	Update policy
On-premises access	Via Interface endpoint in bound VPC	Direct with IAM credentials
Cost implication	VPC endpoint (Gateway=free, Interface=hourly)	NAT Gateway if VPC Lambda needs access

Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.

Phase 12 readiness by workload type

Workload	Phase 12 Ready?	Notes
Controlled PoC / single-account	✅ Ready	All core components verified
Low/moderate event volume (< 100 events/day)	✅ Ready	20-event burst validated
DRY_RUN guardrail validation	✅ Ready	Safe to deploy immediately
Secrets rotation validation	✅ Ready	4-step rotation verified
High-volume replay storm (1000+ events)	⏳ Phase 13	Throughput curve and store capacity not yet measured
Multi-account production	⏳ Phase 13	OAM link deployed but second-account validation pending
Strict SLO operations requiring runbooks	⏳ Phase 13	Dashboard deployed, runbooks not yet written
Live protobuf production mode	⏳ Phase 13	Wire validation with NetApp support pending
Full EVENT_DRIVEN UC end-to-end	⏳ Phase 13	Pipeline verified to SQS, Step Functions flow pending

Phase 13 runbook scope: first-response diagnostic bundle

For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:

# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>

# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>

# EMS logs for FPolicy events
event log show -messagename *fpolicy*

Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.

Deployed Infrastructure

7 CloudFormation stacks deployed and verified:

Stack	Status	Purpose
`fsxn-phase12-guardrails-table`	CREATE_COMPLETE	DynamoDB tracking table
`fsxn-phase12-lineage-table`	CREATE_COMPLETE	Data lineage DynamoDB + GSI
`fsxn-phase12-slo-dashboard`	CREATE_COMPLETE	CloudWatch dashboard + 4 alarms
`fsxn-phase12-oam-link`	CREATE_COMPLETE	Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)
`fsxn-phase12-capacity-forecast`	CREATE_COMPLETE	Lambda + EventBridge schedule
`fsxn-phase12-secrets-rotation`	CREATE_COMPLETE	VPC Lambda + rotation config
`fsxn-phase12-synthetic-monitoring`	CREATE_COMPLETE	Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13

Test Results Summary

Category	Count	Type	Result
Unit Tests	116	Local (CI-reproducible)	✅ All pass
Property Tests (Hypothesis)	53	Local (CI-reproducible)	✅ All pass
CloudFormation Deployments	7 stacks	AWS integration	✅ All CREATE_COMPLETE
Lambda Invocations	2 (forecast + rotation)	AWS integration	✅ Successful
FPolicy E2E	1 pipeline test	AWS manual verification	✅ Event delivered
Replay E2E	5 events	AWS manual verification	✅ Zero loss
20-file burst	20 events	AWS manual verification	✅ Zero loss
Bugs found (property testing)	3	Local (CI-reproducible)	✅ All fixed

NetApp-Specific Takeaways

For NetApp users and partners evaluating this pattern:

FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.

The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.

Conclusion

Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.

The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.

With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11

Event-Driven Ransomware Detection with ONTAP ARP + Datadog

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 09:16:54 +0000

TL;DR

ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.

The Threat Model

Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.

What if your storage could detect the encryption pattern before the host-based tools react?

That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:

Sudden spikes in file entropy (encryption)
Mass file extension changes (.docx → .encrypted)
Abnormal write patterns inconsistent with normal workload behavior

When ARP detects an attack, it changes the volume state to attack-detected and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.

The Detection Pipeline

In Part 2, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.

ONTAP ARP detects encryption behavior
    │
    ▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
    │
    ▼
API Gateway (REST endpoint)
    │
    ▼
Lambda (EMS handler)
    │
    ▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
    │
    ▼
Datadog Monitor → PagerDuty / Slack / Email

End-to-end latency: around 30 seconds in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.

Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.

Production security note: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Before production use, review API Gateway authorization, source IP restrictions, WAF, resource policies, IAM authorization, or a Lambda authorizer. Use the repository’s Security Review Checklist and Webhook Security Guide for the production baseline.

Deploying the EMS Integration

The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.

aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
EMS Lambda	Receives EMS webhooks, normalizes, ships to Datadog
FPolicy Lambda	Receives FPolicy events from SQS, ships to Datadog
API Gateway (from shared EMS webhook stack)	HTTPS endpoint for ONTAP EMS webhooks
IAM Roles	Least-privilege for each Lambda
CloudWatch Log Groups	Execution logs

Webhook Security

For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.

ONTAP EMS Configuration

After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include arw.volume.state and other arw.* events you want to monitor. Refer to the NetApp EMS webhook documentation for destination and filter configuration.

The EMS Lambda Handler

The handler receives an API Gateway proxy event containing the EMS webhook payload:

def lambda_handler(event: dict, context: Any) -> dict:
    """Process EMS webhook from ONTAP via API Gateway."""
    api_key = get_api_key()
    request_id = _get_request_id(event)

    logger.info("EMS handler invoked: requestId=%s", request_id)

    # Extract EMS events from webhook body
    ems_events = _extract_ems_events(event)
    logger.info("Parsed %d EMS event(s)", len(ems_events))

    # Normalize to common schema
    normalized = _normalize_ems_events(ems_events)

    # Format for Datadog
    dd_logs = _format_for_datadog(normalized)

    # Ship to Datadog
    shipped = _ship_to_datadog(dd_logs, api_key)

    return _api_response(200, {
        "message": "EMS events processed",
        "total_events": len(ems_events),
        "shipped": shipped,
    })

EMS Event Normalization

ONTAP EMS events arrive with fields like messageName, severity, node, svmName, parameters. The handler normalizes them:

def _normalize_ems_events(events: list[dict]) -> list[dict]:
    """Normalize raw EMS events to internal schema."""
    normalized = []
    for event in events:
        normalized.append({
            "event_name": event.get("messageName", "unknown"),
            "severity": event.get("severity", "info"),
            "source_node": event.get("node", ""),
            "svm": event.get("svmName", ""),
            "message": event.get("message", json.dumps(event)),
            "parameters": event.get("parameters", {}),
            "timestamp": event.get("time", datetime.now(timezone.utc).isoformat()),
        })
    return normalized

Datadog Formatting (source:fsxn-ems)

def _format_for_datadog(events: list[dict]) -> list[dict]:
    """Format normalized EMS events for Datadog Logs API v2."""
    dd_logs = []
    for event in events:
        dd_logs.append({
            "ddsource": "fsxn-ems",
            "ddtags": f"source:fsxn-ems,service:{DD_SERVICE},env:{DD_ENV}",
            "hostname": event["source_node"],
            "service": DD_SERVICE,
            "message": event["message"],
            "date": event["timestamp"],
            "attributes": {
                "event_name": event["event_name"],
                "severity": event["severity"],
                "source_node": event["source_node"],
                "svm": event["svm"],
                "parameters": event["parameters"],
            },
        })
    return dd_logs

ARP Event Payload (Normalized by Lambda)

ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's _extract_ems_events() function parses the incoming API Gateway proxy event body, then _normalize_ems_events() produces the following internal schema:

{
  "event_name": "arw.volume.state",
  "severity": "alert",
  "source_node": "fsxn-node-01",
  "svm": "svm-prod-01",
  "timestamp": "2026-05-17T01:04:22Z",
  "message": "Anti-ransomware: Volume vol_data state changed to attack-detected",
  "parameters": {
    "volume_name": "vol_data",
    "state": "attack-detected"
  }
}

In Datadog, this arrives as:

source:fsxn-ems
host:fsxn-node-01
service:fsxn-ontap
@attributes.event_name:arw.volume.state
@attributes.severity:alert
@attributes.svm:svm-prod-01
@attributes.parameters.volume_name:vol_data
@attributes.parameters.state:attack-detected

Setting Up the Datadog Monitor

Create a Monitor that triggers on any ARP alert:

Monitor Configuration

Log Explorer search query:

source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected

Datadog Monitor API JSON:

{
  "name": "🚨 FSx for ONTAP: Ransomware Detected (ARP)",
  "type": "log alert",
  "query": "logs(\"source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected\").index(\"*\").rollup(\"count\").last(\"5m\") > 0",
  "message": "🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.\n\n**Volume**: {{attributes.parameters.volume_name}}\n**SVM**: {{attributes.svm}}\n**Node**: {{host}}\n**Time**: {{date}}\n\n## Recommended Actions\n1. Verify the ARP event in ONTAP and Datadog.\n2. Check FPolicy/audit logs for user/client IP correlation.\n3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.\n\n@pagerduty @slack-security-alerts",
  "options": {
    "thresholds": { "critical": 0 },
    "notify_no_data": false,
    "evaluation_delay": 0
  }
}

What This Monitor Does

Triggers on: Any arw.volume.state event with state:attack-detected
Threshold: Critical when count > 0 in a 5-minute window
Notification: PagerDuty + Slack with volume name, SVM, and response steps
No-data handling: Disabled (absence of ARP events is normal)

Adjust template variables ({{attributes.*}}, {{host}}, {{date}}) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.

FPolicy: The Complementary Signal

While ARP detects the encryption pattern, FPolicy provides the file-level detail. Together they answer:

Question	Source
Is ransomware active?	ARP (EMS)
Which files are affected?	FPolicy
Who is doing it?	FPolicy (`user` field)
From where?	FPolicy (`client_ip` field)
What operations?	FPolicy (`operation`: create, write, rename, delete)

FPolicy Event in Datadog

source:fsxn-fpolicy
@attributes.operation:create
@attributes.file_path:/vol/data/finance/confidential_report.xlsx
@attributes.user:suspicious_user@corp.local
@attributes.client_ip:10.0.1.55
@attributes.protocol:cifs

Correlation Query

After an ARP alert, investigate with FPolicy data:

source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)

This shows all file modifications on the affected SVM, helping identify the responsible user and client.

Incident Response Workflow

1. ARP fires → EMS webhook → Datadog alert (around 30 seconds)
     │
2. Responder receives PagerDuty/Slack notification
     │
3. Verify in Datadog and ONTAP:
   - source:fsxn-ems → confirm ARP event details
   - source:fsxn-fpolicy → identify user, IP, affected files
   - ONTAP: security anti-ransomware volume show
     │
4. Correlate and assess:
   - Is this a true positive or legitimate bulk operation?
   - What is the blast radius (volumes, files, users)?
     │
5. Containment (only after verification, per approved runbook):
   - Create snapshot (preserve recovery point)
   - Restrict volume access if confirmed malicious
   - Review ARP suspect list
     │
6. Recovery:
   - Restore from snapshot (pre-attack state)
   - Re-enable access after containment
   - Update audit policies if gaps found

Important: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.

For a more detailed role-based runbook, see the repository's ARP Incident Response Guide.

Beyond ARP: Other EMS Use Cases

The same EMS webhook pipeline handles other critical ONTAP events:

EMS Event	Severity	Use Case
`arw.volume.state`	alert	Ransomware detection
`wafl.quota.softlimit.exceeded`	warning	Capacity planning
`wafl.quota.hardlimit.exceeded`	error	Immediate capacity action
`cf.fsm.takeover`	alert	HA failover notification
`sms.vol.full`	error	Volume full — data at risk
`net.linkDown`	warning	Network connectivity issue

All arrive in Datadog as source:fsxn-ems with the event name in @attributes.event_name, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Validation Results

This integration was validated end-to-end:

Test	Result	Latency
ARP event → Datadog	✅ Arrived	~30 seconds
Quota exceeded → Datadog	✅ Arrived	~30 seconds
FPolicy file create → Datadog	✅ Arrived (via SQS → Lambda path)	~30 seconds
Lambda error handling	✅ DLQ capture	—
API key from Secrets Manager	✅ Cached	—

Validation performed in ap-northeast-1 with the deployed fsxn-datadog-ems-fpolicy stack.

Design Considerations for Security Teams

Webhook security: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.

Detection latency: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.

False positives: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.

Coverage: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify security anti-ransomware volume show before relying on alerts.

Audit trail: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.

Cost profile: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.

Try It Yourself

If you want the shortest path to a first successful ARP alert test, see the repository's minimum quick start.

The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.

# Deploy EMS + FPolicy integration
aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=<your-secret-arn> \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM

# Create a test event file
cat > arp-test-event.json <<EOF
{
  "body": "{\"messageName\":\"arw.volume.state\",\"severity\":\"alert\",\"node\":\"fsxn-node-01\",\"svmName\":\"svm-prod-01\",\"time\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"message\":\"Anti-ransomware: Volume vol_data state changed to attack-detected\",\"parameters\":{\"volume_name\":\"vol_data\",\"state\":\"attack-detected\"}}",
  "requestContext": {"requestId": "test"}
}
EOF

# Invoke Lambda with the test event
aws lambda invoke \
  --function-name fsxn-datadog-ems-fpolicy-ems \
  --payload file://arp-test-event.json \
  --cli-binary-format raw-in-base64-out \
  --region ap-northeast-1 \
  arp-test-output.json

# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state

What's Next

This completes the Datadog series:

Part 1: Architecture and project introduction
Part 2: Audit log pipeline implementation
Part 3: Event-driven ransomware detection (this post)

Coming up next in the series:

Splunk: Replacing EC2 + Universal Forwarder with Lambda + HEC
OpenTelemetry: The vendor-neutral escape hatch
Grafana Cloud: Loki Push API with label cardinality guidance

Each will follow the same pattern: deploy, validate, document the gotchas.

Have questions about ARP detection or the EMS pipeline? Drop a comment below.

Previous: Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Shipping FSx for ONTAP Logs to Datadog — The Serverless Way

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 09:16:31 +0000

TL;DR

Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.

What We're Building

In Part 1, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.

By the end of this post, you'll have:

A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms
ONTAP audit events flowing into Datadog Log Explorer
Structured attributes (@attributes.svm, @attributes.user, @attributes.operation, @attributes.path, @attributes.client_ip, @attributes.result) ready for search, filtering, and Datadog facet creation
An operational CloudWatch dashboard monitoring pipeline health

Prerequisites

Before deploying, you need:

FSx for ONTAP file system with an SVM configured for audit logging
FSx for ONTAP S3 Access Point attached to the audit volume
Datadog account (free trial works) with an API Key
API Key in Secrets Manager:

aws secretsmanager create-secret \
  --name fsxn-datadog-api-key \
  --secret-string '{"api_key":"<your-dd-api-key>"}' \
  --region ap-northeast-1

ONTAP audit logging enabled:

# Time-based rotation for quick validation
vserver audit create -vserver <svm-name> -destination /audit_log \
  -events file-ops \
  -format evtx \
  -rotate-schedule-minute 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit enable -vserver <svm-name>

For quick validation, use time-based rotation. If you only use -rotate-size, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the -events list based on what you want to audit.

Important: Enabling vserver audit is only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.

For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's ONTAP Audit Setup Guide and Operational Guide.

Verify how audit files appear via S3 API (to set AuditLogPrefix correctly):

aws s3api list-objects-v2 \
  --bucket <fsx-s3-access-point-arn-or-alias> \
  --max-keys 10 \
  --region ap-northeast-1

Set AuditLogPrefix to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.

Note: /audit_log is the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify with list-objects-v2.

The CloudFormation Stack

The Datadog integration deploys as a single self-contained stack:

aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
    AuditLogPrefix=<prefix-from-list-objects-v2> \
    ScheduleRate="rate(5 minutes)" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
Lambda Function	Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog
EventBridge Scheduler	Invokes Lambda every 5 minutes
Scheduler IAM Role	Allows Scheduler to invoke Lambda
Lambda Execution Role	S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions
Dead Letter Queue (SQS)	Captures failed events for replay
CloudWatch Alarms (3)	Errors, throttles, DLQ depth
CloudWatch Dashboard	Operational health: errors, duration, invocations, DLQ
CloudWatch Log Group	Lambda execution logs (30-day retention)

Key Parameters

Parameter	Required	Description
`FsxS3AccessPointArn`	✅	FSx for ONTAP S3 Access Point ARN
`DatadogApiKeySecretArn`	✅	Secrets Manager ARN for the API key
`DatadogSite`	❌	Datadog site (default: `ap1.datadoghq.com`)
`ScheduleRate`	❌	Processing frequency (default: `rate(5 minutes)`)
`AuditLogPrefix`	❌	Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root.
`VpcEnabled`	❌	Enable VPC config — requires NAT Gateway

The Lambda Handler

The handler follows a straightforward flow:

Scheduled invocation
  → List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
  → Filter by checkpoint (skip already-processed files)
  → For each new file:
      → Read via S3 GetObject
      → Detect format (EVTX magic bytes or XML declaration)
      → Parse into normalized events
      → Format for Datadog Logs API v2
      → Batch (≤5MB, ≤1000 items per request)
      → Ship with exponential backoff (max 3 attempts)
  → Update checkpoint

Datadog API Limits

The Datadog Logs API v2 enforces the following per-request limits (docs):

Maximum payload size (uncompressed): 5MB
Maximum size for a single log: 1MB (larger logs are truncated, not rejected)
Maximum array size: 1000 entries

The shipper batches conservatively below these limits.

Core Shipping Logic

def _ship_to_datadog(logs: list[dict], api_key: str) -> int:
    """Ship normalized logs to Datadog Logs Intake API v2.

    If any batch fails after retries, raise an exception so the Lambda
    invocation is treated as failed and the checkpoint is not advanced.
    """
    shipped = 0
    failed_batches = 0

    for batch in _create_batches(logs):
        if _send_batch(batch, api_key):
            shipped += len(batch)
        else:
            failed_batches += 1

    if failed_batches:
        raise RuntimeError(f"{failed_batches} batch(es) failed after retries")

    return shipped

Checkpoint Semantics

The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.

This makes the pipeline at-least-once: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.

Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.

Retry with Exponential Backoff

def _send_batch(batch: list[dict], api_key: str) -> bool:
    """Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts."""
    for attempt in range(MAX_RETRIES):
        response = http.request(
            "POST",
            DATADOG_LOGS_URL,
            body=json.dumps(batch).encode("utf-8"),
            headers={
                "Content-Type": "application/json",
                "DD-API-KEY": api_key,
            },
        )
        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            time.sleep(2 ** attempt + random.uniform(0, 1))  # jitter
            continue
        # Client error (4xx) — don't retry
        return False
    return False

The implementation uses exponential backoff with jitter (2^attempt + random) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that MAX_RETRIES in the code represents the total number of attempts, not retries after an initial attempt.

API Key Caching

The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:

_api_key_cache: str | None = None

def get_api_key() -> str:
    global _api_key_cache
    if _api_key_cache:
        return _api_key_cache
    response = secrets_client.get_secret_value(SecretId=API_KEY_SECRET_ARN)
    secret = json.loads(response["SecretString"])
    _api_key_cache = secret.get("api_key", secret.get("dd_api_key", response["SecretString"]))
    return _api_key_cache

Datadog Field Mapping

Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (ddsource, hostname, service, message) and custom attributes nested under attributes:

Datadog Log Explorer	Payload Field	ONTAP Source	Example
`source`	`ddsource`	Configured	`fsxn`
`service`	`service`	Configured	`fsxn-ontap`
`host`	`hostname`	SVM name	`svm-prod-01`
`@attributes.svm`	`attributes.svm`	SVMName / Computer	`svm-prod-01`
`@attributes.user`	`attributes.user`	UserName / SubjectUserName	`admin@corp.local`
`@attributes.client_ip`	`attributes.client_ip`	ClientIP / IpAddress	`10.0.1.50`
`@attributes.operation`	`attributes.operation`	Operation / ObjectType	`ReadData`
`@attributes.path`	`attributes.path`	ObjectName	`/vol/data/reports/q4.xlsx`
`@attributes.result`	`attributes.result`	Result / Keywords	`Success`
`@attributes.event_type`	`attributes.event_type`	EventID	`4663`
`@attributes._pipeline.processed_at`	`attributes._pipeline.processed_at`	Lambda timestamp	`2026-05-17T01:30:00Z`
`@attributes._pipeline.source_file`	`attributes._pipeline.source_file`	S3 object key	`audit_log/audit_svm_20260517.evtx`

Set DatadogSite to your Datadog site, such as datadoghq.com (US1), datadoghq.eu (EU1), or ap1.datadoghq.com (AP1/Tokyo). The site determines the API endpoint.

For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Datadog Search Queries

# All FSx for ONTAP audit events
source:fsxn

# Failed access attempts
source:fsxn @attributes.result:Failure

# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"

# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"

# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*

In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.

Investigation Query Starters

When investigating an incident, start with these patterns:

Question	Search query	Then group by
What did this user do?	`source:fsxn @attributes.user:"suspect@corp.local"`	`@attributes.operation` or `@attributes.path`
Who accessed this file?	`source:fsxn @attributes.path:"/vol/data/secret.pdf"`	`@attributes.user`
Which clients generated failures?	`source:fsxn @attributes.result:Failure`	`@attributes.client_ip`
Where are deletes concentrated?	`source:fsxn @attributes.operation:delete`	`@attributes.path` or a path prefix
What happened on this SVM in the last hour?	`source:fsxn @attributes.svm:svm-prod-01`	`@attributes.operation`

For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.

Operational Validation

For a structured PoC sign-off, use the repository’s PoC Success Criteria document. It defines minimum success, operational success, and production-readiness gates across audit logs, EMS, FPolicy, and multi-backend patterns.

Quick Validation (5–10 minutes)

With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.

Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.

# 0. Get stack outputs (log group name, DLQ URL, etc.)
aws cloudformation describe-stacks \
  --stack-name fsxn-datadog-integration \
  --query 'Stacks[0].Outputs' \
  --region ap-northeast-1

# 1. Confirm Scheduler is invoking Lambda
aws logs filter-log-events \
  --log-group-name <LambdaLogGroupName from outputs> \
  --start-time $(python3 -c "import time; print(int((time.time()-300)*1000))") \
  --region ap-northeast-1

# 2. Confirm DLQ is empty
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names All \
  --query 'Attributes.ApproximateNumberOfMessages'

# 3. Search in Datadog
#    source:fsxn

CloudWatch Dashboard

The stack includes a pre-built dashboard (fsxn-datadog-integration-health) with:

Lambda Errors & Throttles
Lambda Duration (avg/max)
Lambda Invocations
DLQ Depth

For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.

What to Watch For

Symptom	Likely Cause	Fix
No logs in Datadog	Scheduler not running, or no new audit files	Check CloudWatch Logs for Lambda invocations
Logs arrive but fields are empty	EVTX/XML parsing issue	Check `@attributes.event_type` — if "unknown", parser needs tuning
DLQ messages appearing	Datadog API rejection	Check API key validity, site configuration, timestamp age
Lambda timeout	S3 AP access issue (VPC Gateway EP?)	Verify NAT Gateway or deploy Lambda outside VPC

Troubleshooting

Old Timestamps May Not Appear in Log Explorer

The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.

Fix: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.

Gzip Compression Issue (AP1 Site)

During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The ENABLE_GZIP parameter defaults to false for this reason.

S3 Access Point Timeout in VPC

If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.

Day-2 Operations

DLQ Replay

This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), sqs start-message-move-task cannot redrive messages automatically.

For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:

# Inspect failed messages
aws sqs receive-message \
  --queue-url <dlq-url> \
  --max-number-of-messages 1 \
  --attribute-names All \
  --message-attribute-names All

After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:

aws lambda invoke \
  --function-name <lambda-function-name> \
  --cli-binary-format raw-in-base64-out \
  --payload '{}' \
  --region ap-northeast-1 \
  replay-output.json

In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.

For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.

Checkpoint Reset (Reprocess All Files)

⚠️ Warning: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.

aws dynamodb delete-item \
  --table-name fsxn-observability-audit-checkpoint \
  --key '{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'

Teardown

aws cloudformation delete-stack \
  --stack-name fsxn-datadog-integration \
  --region ap-northeast-1

Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.

Cost Estimate

For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):

Component	Monthly Cost
Lambda (288 invocations/day × 5s avg)	~$0.50
EventBridge Scheduler	~$0.01
DynamoDB (checkpoint)	~$0.01
Secrets Manager	~$0.40
CloudWatch Logs (30-day)	~$1.00
NAT Gateway (if VPC)	Region-dependent hourly + per-GB
Total (no VPC)	~$2/month
Total (with VPC/NAT)	~$30–50+/month depending on Region

Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the AWS Pricing Calculator for your target Region.

Important: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.

Evidence retention: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.

Cost control: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.

Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.

What's Next

In Part 3, we'll add event-driven security alerting:

ONTAP Autonomous Ransomware Protection (ARP) detection
EMS webhook → API Gateway → Lambda → Datadog
Datadog Monitor configuration for instant alerts
Incident response workflow

Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.

Questions about the Datadog integration? Drop a comment below.

Previous: Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Next: Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog

Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 09:14:20 +0000

TL;DR

FSx for ONTAP file access audit logs are usually consumed through EC2-based patterns — mounted audit volumes and agent-based forwarders such as Splunk Universal Forwarder. This series explores an EC2-free alternative: configure ONTAP to write audit logs to an audit volume, expose that volume through an FSx for ONTAP S3 Access Point, use EventBridge Scheduler to invoke Lambda, and ship normalized events to observability platforms such as Datadog, Splunk, New Relic, Grafana Cloud, Elastic, and OpenTelemetry-compatible backends.

What This Post Covers

This post introduces the architecture and the open-source pattern library. It does not yet cover:

Full Datadog deployment walkthrough (Part 2)
Vendor-specific field mappings
Cost/performance benchmarking
ARP + EMS webhook + Datadog alerting (Part 3)
FPolicy binary protocol internals (future post)

The Problem Nobody Talks About

You're running Amazon FSx for NetApp ONTAP. You've enabled file access auditing because compliance requires it — or because you genuinely want to know who's accessing what on your file shares.

But where do those audit logs go?

If you followed the official AWS blog post, you likely ended up with EC2-based collectors: syslog-ng for cluster/admin audit forwarding, and a mounted audit volume plus Splunk Universal Forwarder for file access audit logs. It works. But now you have:

EC2 instances to patch and maintain
NFS mounts to the audit volume
syslog-ng configuration for admin audit forwarding
Splunk Universal Forwarder configuration for file access logs
A single point of failure unless you build your own HA pattern
Vendor lock-in to Splunk's agent-based model

What if you could replace that EC2-based collector pattern with managed services — Lambda reads audit logs via S3 APIs, no NFS mount required — and ship to any observability platform?

That's the goal of this project.

Important Distinction: Two Types of ONTAP Audit

Before diving in, a clarification. FSx for ONTAP has two distinct audit mechanisms:

Cluster/admin activity audit logs — Administrative operations (CLI/API commands). These are forwarded via syslog to a log destination, as described in the AWS blog.
File access audit logs — SMB/NFS file operations (open, read, write, delete, permission changes). These are recorded based on ONTAP audit policies and SACLs/NFSv4 ACLs, stored in EVTX or XML format, depending on your ONTAP audit configuration, on an audit volume inside the SVM.

In this series, "audit logs" refers to file access audit logs (type 2). The cluster admin audit forwarding via syslog is a separate concern.

The EC2-Free Alternative

I'm building an open-source pattern library that targets 9 observability vendors using Lambda, EventBridge Scheduler, and ECS Fargate — eliminating the need for self-managed EC2 instances.

This is EC2-free, not necessarily Lambda-only:

The audit-log and EMS paths are Lambda patterns (scheduled and event-driven respectively).
The FPolicy path uses ECS Fargate because ONTAP FPolicy requires a persistent TCP listener.

How Audit Logs Flow

ONTAP's file access auditing writes rotated audit log files to a configured destination path inside the SVM. In this project, that destination is an audit volume exposed through an FSx for ONTAP S3 Access Point. Lambda does not mount NFS or SMB; it reads the rotated audit log files through S3 APIs.

Because this pattern does not rely on S3 ObjectCreated events from FSx for ONTAP S3 Access Points, the audit processor is invoked on a schedule and uses checkpointing to process only newly rotated log files.

FSx for ONTAP audit configuration (`vserver audit`)
    │
    ▼ audit logs written to /audit volume
Audit volume exposed via FSx for ONTAP S3 Access Point
    │
    ▼ EventBridge Scheduler (periodic invocation)
Lambda audit processor (Python 3.12)
    │
    ▼ parse EVTX/XML → normalize → vendor API
Datadog / Splunk / New Relic / Grafana / Elastic / ...

The key shift from the EC2 pattern: Lambda does not mount the audit volume over NFS or SMB. It reads rotated ONTAP audit log files through an FSx for ONTAP S3 Access Point using S3 APIs, while the data itself remains on the FSx for ONTAP file system.

A note on FSx for ONTAP S3 Access Points: FSx for ONTAP S3 Access Points let applications use S3 APIs to access data that still resides on FSx for ONTAP volumes. They are excellent as a serverless access boundary, but they are not the same as standard S3 buckets. In particular, you should not rely on S3 ObjectCreated notifications from an FSx for ONTAP S3 Access Point. Instead, this project uses EventBridge Scheduler plus checkpointing to discover and process newly rotated audit log files.

Three Event Sources, One Architecture

FSx for ONTAP generates observability data through three distinct channels:

1. File Access Audit Logs (FSx for ONTAP S3 AP)

Depending on your ONTAP audit configuration and SACL/NFSv4 ACL settings, file operations such as create, delete, read, write, and permission changes can be recorded as ONTAP audit logs in EVTX or XML format.

Delivery: ONTAP writes rotated audit log files to an audit volume inside the SVM
Access path: Lambda reads those files through an FSx for ONTAP S3 Access Point
Trigger: EventBridge Scheduler invokes Lambda periodically; Lambda uses checkpointing to process newly rotated files
Compute: Lambda (scheduled, pay-per-invocation)
Latency: Near-real-time rather than sub-second streaming. End-to-end latency depends on your ONTAP audit log rotation interval and the EventBridge Scheduler frequency.
Use case: Compliance auditing, access pattern analysis, data governance

2. EMS (Event Management System) Webhooks

ONTAP's built-in event system can push critical alerts via HTTP webhooks. This includes:

Autonomous Ransomware Protection (ARP) alerts — ONTAP detects encryption patterns and fires an event
Quota threshold violations
Hardware failures
Replication issues
Delivery: ONTAP pushes HTTPS webhook to API Gateway
Trigger: API Gateway invocation (event-driven)
Compute: Lambda (behind API Gateway)
Use case: Security alerting, operational monitoring

3. FPolicy (File Policy) Events

FPolicy intercepts file operations at the protocol level (CIFS/NFS) and forwards them in real-time via a proprietary TCP protocol. Unlike the other two sources, FPolicy requires a persistent TCP listener — which is why this path uses ECS Fargate rather than Lambda.

Delivery: ONTAP connects to Fargate task via TCP:9898
Trigger: Fargate receives FPolicy events → enqueues to SQS → Lambda processes
Compute: ECS Fargate (TCP listener) + Lambda (vendor shipping)
Use case: File activity monitoring, DLP, suspicious behavior detection

Note: The FPolicy path is the one exception to the "pure Lambda" model. ONTAP's FPolicy protocol is a proprietary binary format over TCP — it cannot be received by API Gateway or Lambda directly. Fargate handles the protocol translation, then hands off to Lambda via SQS for the vendor-specific shipping. It's still EC2-free, but not entirely serverless in the strictest sense.

The Architecture

Each event source feeds into the same delivery pattern:

┌─────────────────────────────────────────────────────────────────┐
│                    FSx for ONTAP                                │
├──────────────┬──────────────────────┬───────────────────────────┤
│ File Access  │   EMS Webhook        │   FPolicy (TCP:9898)      │
│ Audit Logs   │                      │                           │
└──────┬───────┴──────────┬───────────┴───────────┬───────────────┘
       │                  │                       │
       ▼                  ▼                       ▼
  FSx S3 AP +        API Gateway            ECS Fargate
  Scheduler               │                       │
       │                  ▼                       ▼
       ▼             Lambda (EMS)           SQS → Lambda
  Lambda (parser)         │                       │
       │                  │                       │
       └──────────────────┼───────────────────────┘
                          ▼
              Observability Vendor API
              (Datadog, Splunk, New Relic, ...)

Each integration packages the parser and vendor shipper together in a single Lambda, but the pattern is the same: normalize ONTAP events, then send them to the vendor API. Swap the integration Lambda, and you switch vendors. Vendor-specific Lambdas are optimized for quick adoption and native API behavior, while the OpenTelemetry integration provides a vendor-neutral path for organizations standardizing on OTLP.

The Gotcha That Cost Me a Day

Here's something that isn't immediately obvious from the documentation:

In my validation, a Lambda function placed in a VPC with only an S3 Gateway Endpoint could not read from the FSx for ONTAP S3 Access Point and timed out. Adding NAT Gateway egress resolved the issue.

This gotcha matters because this project intentionally reads audit logs through FSx for ONTAP S3 Access Points rather than mounting the audit volume over NFS/SMB from an EC2 instance.

Tested with:

Lambda in private subnets (ap-northeast-1)
FSx for ONTAP S3 Access Point attached to an FSx volume
S3 Gateway VPC Endpoint only
No NAT Gateway
Failure mode: timeout (no response, not AccessDenied)

Your options:

Lambda Placement	FSx for ONTAP S3 AP Access	Recommendation
Outside VPC	✅ Works	Simplest for read-only access
VPC + NAT Gateway	✅ Works	Production recommended
VPC + S3 Gateway EP only	❌ Timeout	Not recommended based on this validation

This is based on my validation environment (ap-northeast-1). Always test the network path in your own account and Region, as AWS may update this behavior.

Target Vendors

The project targets 9 observability platforms. Datadog is fully verified end-to-end (the subject of Parts 2 and 3 of this series). The remaining vendors have initial implementations that I'll be verifying and writing about in upcoming posts:

Vendor	Delivery Method	Status
Datadog	Logs API v2	✅ E2E verified
Splunk	HEC (HTTP Event Collector)	🧪 Implementation ready, verification planned
New Relic	Log API v1	🧪 Implementation ready, verification planned
Grafana Cloud	Loki Push API	🧪 Implementation ready, verification planned
Elastic	Bulk API	🧪 Implementation ready, verification planned
Dynatrace	Log Ingest API v2	🧪 Implementation ready, verification planned
Sumo Logic	HTTP Source	🧪 Implementation ready, verification planned
Honeycomb	Events Batch API	🧪 Implementation ready, verification planned
OpenTelemetry	OTLP/HTTP (vendor-neutral)	🧪 Implementation ready, verification planned

Status definitions:

✅ E2E verified — Deployed and validated with real FSx for ONTAP audit logs
🧪 Implementation ready — Code and CloudFormation available; E2E validation pending
🚧 Planned — Design exists; implementation pending

Each vendor integration is designed as a self-contained CloudFormation stack with its own Lambda, IAM roles, DLQ, and CloudWatch alarms. As I verify each one, I'll publish a dedicated article with the results and any vendor-specific gotchas I encounter.

What's in the Repo

The project is structured for easy adoption:

fsxn-observability-integrations/
├── integrations/
│   ├── datadog/           # ✅ Verified: Lambda + CFn + tests + docs
│   ├── splunk-serverless/ # 🧪 Implementation ready
│   ├── new-relic/         # 🧪 Implementation ready
│   ├── grafana/           # 🧪 Implementation ready
│   ├── elastic/           # 🧪 Implementation ready
│   ├── dynatrace/         # 🧪 Implementation ready
│   ├── sumo-logic/        # 🧪 Implementation ready
│   ├── honeycomb/         # 🧪 Implementation ready
│   └── otel-collector/    # 🧪 Implementation ready
├── shared/
│   ├── lambda-layers/     # Reusable log parser (EVTX/XML) + S3 AP reader
│   ├── templates/         # Prerequisites CFn (EventBridge Scheduler, IAM)
│   └── scripts/           # Deploy + test utilities
└── docs/                  # Bilingual (EN/JA) documentation

The shared infrastructure (EventBridge Scheduler, log parser layer, IAM roles) is vendor-agnostic and already proven through the Datadog verification. Each vendor directory follows the same structure, so once you understand one, you understand them all. Each stack is designed to include DLQ, CloudWatch alarms, and operational visibility out of the box; the Datadog stack also includes the verified CloudWatch operational dashboard used during E2E validation.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

If you've been following my FSx for ONTAP S3 Access Points series, this project builds directly on those foundations:

FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — Where this journey started: using S3 APs as the bridge between ONTAP and serverless
Production-Ready FPolicy Event Pipeline Across 17 UCs — Phase 11 — The FPolicy pipeline that feeds into this observability project
Near-Real-Time Processing, ML Inference, and Observability — Phase 3 — Early architecture patterns that evolved into this multi-vendor approach

This observability integrations project is the natural next step: taking those serverless patterns and applying them specifically to audit log shipping across multiple vendors.

Design Considerations

Based on early feedback, here are key points for different audiences:

Design philosophy: The goal is not just to remove EC2. The goal is to move undifferentiated collector operations into managed services, make failures observable and replayable, and keep the integration layer small enough for customers to operate themselves.

Where this pattern matters: This pattern is especially useful for enterprise file workloads where auditability matters but EC2-based collectors add operational overhead —
departmental file shares, enterprise application interface directories such as SAP, Oracle, or SQL Server adjacent file shares, VDI/EUC home directories, engineering and design repositories, regulated file repositories, and ransomware investigation workflows.

Non-intrusive by design: This pipeline observes audit logs after ONTAP records them; it does not sit in the application data path. NFS/SMB access patterns are unchanged. No application code changes are required.

Telemetry ownership: This pattern treats ONTAP as the authoritative source of file activity telemetry, while AWS managed services provide the event processing and delivery layer.

Compliance note: This pattern helps centralize and analyze audit events, but retention, immutability, and regulatory controls should be designed according to your organization's compliance requirements. This is an audit log delivery pattern, not a compliance certification. For audit evidence, consider separately how long raw EVTX/XML files should be retained on the audit volume or archived outside the observability pipeline.

Audit policy dependency: The quality and volume of events depend heavily on your ONTAP audit policy, SACLs, NFSv4 ACLs, and rotation interval. Enabling read auditing on high-traffic volumes can produce significant log volume — design your audit policy carefully.

Cost variables: The biggest cost factors are audit event volume, log rotation frequency, EventBridge Scheduler frequency, Lambda runtime, NAT Gateway usage (if Lambda is in VPC), and vendor ingest pricing. Compared to the EC2 pattern, you trade always-on instance cost for pay-per-invocation compute and vendor-ingest-driven cost.

Multi-account deployment: This pattern can be deployed per workload account or centralized into a logging/security account, depending on your organization's landing zone design.

Reliability: The stack includes DLQ for failed events, CloudWatch alarms for error/throttle detection, and checkpointing to avoid reprocessing already completed audit log files. Delivery to external vendor APIs should be treated as at-least-once; DLQ messages can be replayed after resolving the root cause.

What's Coming Next

This is Part 1 of a series. In the upcoming posts, I'll deep-dive into:

Part 2: Implementing the Datadog integration end-to-end — from CloudFormation to seeing logs in the Datadog Log Explorer
Part 3: Event-driven ransomware detection using ONTAP's Autonomous Ransomware Protection (ARP) + EMS webhooks + Datadog alerting

Beyond this Datadog series, I'll be verifying and writing about each vendor integration as I go:

Replacing the EC2-based Splunk pattern with Lambda + HEC
OpenTelemetry as the vendor-neutral escape hatch
Grafana Cloud + Loki for the open-source stack
And more — each with its own E2E verification and lessons learned

The goal is to build a comprehensive, battle-tested pattern library where you can pick your vendor and deploy with confidence. Follow along as I work through each one.

Try It Yourself

The Datadog integration is fully verified and ready to deploy. You'll need:

An FSx for ONTAP file system with audit logging enabled
An FSx for ONTAP S3 Access Point attached to the audit volume

git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git
cd fsxn-observability-integrations

# Deploy Datadog integration
# (FsxS3AccessPointArn = your FSx for ONTAP S3 Access Point ARN)
aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=<your-fsx-s3-ap-arn> \
    DatadogApiKeySecretArn=<your-secret-arn> \
  --capabilities CAPABILITY_NAMED_IAM

This stack deploys the scheduled Lambda processor, IAM permissions for reading from the FSx for ONTAP S3 Access Point, checkpoint storage, DLQ, CloudWatch alarms, and the Datadog shipping logic. The processor keeps track of already-processed audit log files so each scheduled invocation only ships newly rotated logs.

After deployment, you should see:

EventBridge Scheduler invoking the Lambda processor on your configured interval
Checkpoint storage updated after processing rotated audit logs
Parsed FSx for ONTAP audit events arriving in Datadog Logs (source:fsxn)
CloudWatch alarms and DLQ ready for operational visibility

Full setup guide in the repo's Prerequisites doc.

If you are starting from the repository today, begin with:

Choose Your Path
Recommended first 30 minutes
Try with Sample Data
PoC Success Criteria

5. Production Readiness Levels

Have questions or want to see a specific vendor integration verified next? Drop a comment below — it'll help me prioritize the series.

Next up: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way

Production-Ready FPolicy Event Pipeline Across 17 UCs — FSx for ONTAP S3 Access Points, Phase 11

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Fri, 15 May 2026 07:52:56 +0000

TL;DR

Phase 11 is the production-integration phase: the Phase 10 FPolicy event-ingestion pipeline is now connected to all 17 use-case (UC) templates, with operational guardrails for persistence, deduplication, observability, and future migration to native S3 Access Point (S3AP) notifications.

This is Phase 11 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10, Phase 11 delivers:

TriggerMode across all 17 UCs: Every UC template now supports POLLING / EVENT_DRIVEN / HYBRID switching via a single CloudFormation parameter
UC-specific EventBridge dispatch rules: File path prefix + extension filters route FPolicy events to the correct UC's Step Functions
Protobuf format evaluation: Real-world test on ONTAP 9.17.1P6 — confirmed format switching works, discovered TCP framing difference
Cross-Account Observability: OAM Sink + Dashboard + SNS + X-Ray deployed and verified
Persistent Store: Configured on ONTAP via REST API — closing the tested Fargate restart event-loss window at the configuration layer
Idempotency Store: DynamoDB table + checker Lambda for HYBRID mode deduplication
FR-2 migration path: Three-phase design for transitioning to S3AP native notifications when available (FR-2 refers to the feature-request track for native bucket-notification-style support on FSx ONTAP S3 Access Points)
Production adoption guidance: Rollout/rollback, governance, security guardrails, event payload sensitivity, file-readiness patterns, operational alarms, and Persistent Store sizing

The 17 UCs span compliance, financial document processing (IDP), manufacturing analytics, healthcare imaging, media/VFX, genomics, logistics, retail, autonomous driving, semiconductor EDA, energy/seismic, education/research, defense/satellite, government archives, smart-city geospatial, insurance claims, and construction BIM.

In short: Phase 10 built the shared event-ingestion pipeline. Phase 11 wires it into every UC, adds the operational infrastructure for production (Persistent Store, Idempotency, Observability), and documents the forward migration path. Tests: 435 passed, 3 skipped.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. TriggerMode: Three-Mode Integration Across All 17 UCs

The problem

Phase 10 introduced TriggerMode as a reference implementation in UC1 (legal-compliance). The remaining 16 UCs still only supported polling. Operators needed a uniform way to switch any UC between polling, event-driven, and hybrid modes without template surgery.

The solution

Every UC template now includes:

Parameters:
  TriggerMode:
    Type: String
    Default: "POLLING"
    AllowedValues: ["POLLING", "EVENT_DRIVEN", "HYBRID"]

  FPolicyEventBusName:
    Type: String
    Default: "fsxn-fpolicy-events"

Conditions:
  IsPollingOrHybrid:
    !Or [!Condition IsPolling, !Condition IsHybrid]
  IsEventDrivenOrHybrid:
    !Or [!Condition IsEventDriven, !Condition IsHybrid]

The EventBridge Scheduler and its IAM role use Condition: IsPollingOrHybrid. The FPolicy EventBridge Rule and its IAM role use Condition: IsEventDrivenOrHybrid. Default POLLING means zero impact on existing deployments — the parameter is purely additive.

Validation

CloudFormation validate-template: 17/17 PASS
cfn_yaml parse: 17/17 PASS
SchedulerRole + Schedule condition alignment: 14/14 verified
Test suite: 435 passed, 3 skipped, 0 failed

2. UC-Specific EventBridge Dispatch Rules

Architecture

EventBridge Custom Bus (fsxn-fpolicy-events)
  │
  ├── UC1 Rule: prefix=/legal/ OR suffix=.pdf,.docx,.xlsx
  │     → ComplianceStateMachine
  │
  ├── UC2 Rule: prefix=/finance/ OR suffix=.pdf,.tiff,.png,.jpg
  │     → IdpStateMachine
  │
  ├── UC3 Rule: prefix=/manufacturing/ OR suffix=.csv,.json,.parquet
  │     → ManufacturingStateMachine
  │
  │   ... (14 more UCs)
  │
  └── UC17 Rule: prefix=/smartcity/ OR suffix=.geojson,.shp,.tiff,.las
        → DiscoveryFunction (Lambda)

Note: Multiple rules can match the same event; EventBridge fan-out is expected behavior. See the Live E2E verification below.

As the number of UCs grows, routing should be treated as configuration data and used to generate both EventBridge rules and routing tests to prevent drift. The routing definitions documented in docs/guides/fpolicy-uc-routing.md are treated as the source of truth, and scripts/add_eventbridge_rules.py keeps generated EventBridge rules aligned with that routing model.

Each UC's EventBridge Rule filters on:

detail.file_path: prefix (directory) and suffix (extension) matchers
detail.operation_type: create, write, rename, delete (UC-specific subset)

EventBridge evaluates prefix and suffix within the same array as OR — a file matching any prefix or any suffix triggers the rule. The relationship between operation_type and file_path is AND — both must match.

Fan-out behavior

When multiple rules match the same event, EventBridge delivers to all matching targets. This is by design — a .json file in /manufacturing/sensors/ could trigger both UC3 (manufacturing) and UC11 (autonomous-driving) if both monitor .json files. Prefix design should minimize unintended fan-out.

Live E2E verification

We verified dispatch routing by sending test events directly to the custom bus via aws events put-events:

Test Event	file_path	Matched Rules	Result
`verify-legal-01`	`/legal/audit/report.pdf`	legal-compliance ✅ + financial-idp ✅	Fan-out: 2 rules matched
`verify-finance-01`	`/finance/contracts/deal.tiff`	financial-idp ✅	1 rule matched
`verify-mfg-01`	`/manufacturing/iot/sensor-001.json`	manufacturing ✅	1 rule matched
`verify-nomatch-01`	`/random/path/file.xyz`	None	Correctly dropped

Key finding: /legal/audit/report.pdf matched two rules — the legal-compliance rule (prefix /legal/) AND the financial-idp rule (suffix .pdf). This confirms the OR evaluation within the file_path array and demonstrates fan-out behavior in practice.

Recommendation: The main routing lesson is simple: use path prefix as the primary ownership boundary, and treat suffix filters as supplementary hints. Generic suffixes such as .pdf, .json, and .csv are useful for discovery, but they can create intentional or accidental fan-out across UCs. For strict single-UC routing, rely on prefix alone.

Routing documentation

Full routing table with all 17 UCs, their prefixes, suffixes, and target operations is documented in docs/guides/fpolicy-uc-routing.md.

3. Protobuf Format Evaluation

Background

ONTAP 9.15.1+ supports protobuf as an alternative to XML for FPolicy notifications. The theoretical benefits are significant: ~35% message size reduction and faster parsing (with C extensions).

Implementation

Phase 11 delivers a complete protobuf implementation:

Wire-format parser (shared/fpolicy-server/protobuf_parser.py): Pure Python decoder with zero external dependencies. No protobuf package installation required.
Proto schema (shared/fpolicy-server/proto/fpolicy_notification.proto): 14-field FileOperationNotification message definition.
Auto-detection: is_protobuf_format() distinguishes XML from protobuf by inspecting the first byte.
FPolicy Server integration: FPOLICY_FORMAT environment variable switches between xml and protobuf.

Benchmark results (1000 events)

Metric	XML (regex)	protobuf (pure Python)
Message size (avg)	220 bytes	144 bytes
Size reduction	—	34.6%
Parse time (1000 events)	0.15 ms	0.32 ms
Parse speedup	1.0x (baseline)	0.47x

The pure Python protobuf parser is slower than Python's C-optimized regex engine. The real benefit is message size reduction — 34.6% fewer bytes through SQS means lower costs and bandwidth. With the C-compiled protobuf library, parsing speed is expected to improve significantly, but this should be re-benchmarked after the protobuf TCP framing layer is implemented.

Real-world test: TCP framing discovery

We switched the ONTAP FPolicy engine format to protobuf via REST API:

PATCH /api/protocols/fpolicy/{svm}/engines/fpolicy_aws_engine
Body: {"format": "protobuf"}

Result: ONTAP immediately sent protobuf NEGO_REQ messages. However, the FPolicy server logged:

[WARNING] Invalid message length: 53554736

Analysis: The value 53554736 (0x03320330) is protobuf field data being misinterpreted as the 4-byte frame length. This reveals that protobuf mode uses different TCP framing than XML mode:

XML mode: " + 4-byte big-endian length + " + payload
protobuf mode: Different framing (possibly raw protobuf without the quote-delimited wrapper)

Conclusion: The protobuf field-level parser is validated by the Phase 11 unit tests, and the size-reduction benefit is real. However, the live ONTAP test showed that protobuf mode does not use the same TCP framing path as XML mode. Per NetApp documentation, when the engine format is set to protobuf, "notification messages are encoded in binary form using Google Protobuf" and the FPolicy server must support protobuf deserialization. Phase 12 will focus on confirming the protobuf wire framing with NetApp and adapting the transport reader accordingly.

4. Cross-Account Observability

Deployed resources

Resource	Purpose
OAM Sink	Receives metrics/traces from workload accounts
CloudWatch Dashboard	Lambda Invocations/Errors, Step Functions Executions, Processing Latency
SNS Topic (KMS-encrypted)	Aggregated alerts from all accounts
X-Ray Group	Cross-account trace filtering
IAM MetricDeliveryRole	Workload accounts assume this to push metrics
IAM TroubleshootingRole	Read-only access for cross-account debugging
Log Group	Aggregated log destination

Single-account limitation

OAM Links cannot be created within the same account (AWS design constraint). The deployment was verified as a single-account simulation per the requirements. A workload-account-oam-link.yaml template is provided for multi-account environments.

Template fix: LogDestination

During deployment, AWS::Logs::Destination failed because it requires a Kinesis Data Stream as target, not a Log Group. This clarified that a CloudWatch Logs destination is not a generic alias for another log group; it is a cross-account subscription destination backed by a supported streaming target such as Kinesis Data Streams or Kinesis Data Firehose. The template was fixed to use Log Group + IAM Role directly, with Kinesis Firehose as an optional future addition for high-volume cross-account log aggregation.

5. Persistent Store: Closing the Restart Event-Loss Window

The problem

With is-mandatory=false, ONTAP drops FPolicy notifications when no server is connected. During Fargate task restarts (~30 seconds), events are lost.

The solution

ONTAP 9.14.1+ Persistent Store queues file access events on the SVM during server disconnection for asynchronous non-mandatory policies. When the external server reconnects, queued events can be replayed. Note that synchronous policies and asynchronous mandatory policies are not supported — Persistent Store is specifically designed for the asynchronous non-mandatory configuration used in this pattern.

Configuration (via Lambda → ONTAP REST API)

Step 1: Create volume (1GB, unix security style)
  POST /api/storage/volumes → 202 Accepted (3s)

Step 2: Create Persistent Store
  POST /api/protocols/fpolicy/{svm}/persistent-stores → 201 Created
  Body: {"name": "fpolicy_aws_store", "volume": "fpolicy_persistent_store"}

Step 3: Attach to policy (disable → attach → re-enable)
  PATCH /api/protocols/fpolicy/{svm}/policies/fpolicy_aws
  Body: {"persistent_store": "fpolicy_aws_store"}

Verification

GET /api/protocols/fpolicy/{svm}/policies/fpolicy_aws?fields=persistent_store,enabled
→ {"enabled": true, "persistent_store": "fpolicy_aws_store"}

ECS task stop → restart test confirmed ONTAP reconnects to the new task within seconds. With Persistent Store configured, events generated during the tested ~30-second Fargate restart window are expected to be queued by ONTAP and replayed after reconnection. Phase 12 will validate this with real NFS/SMB file operations end to end, including verification of replay ordering and completeness under sustained write load.

IP Updater Lambda extension

The IP Updater Lambda was extended with a generic ONTAP API access capability (action: ontap_api). This enables remote ONTAP configuration without a bastion host:

aws lambda invoke --function-name fsxn-fpolicy-ip-updater \
  --payload '{"action": "ontap_api", "method": "GET", "path": "/api/protocols/fpolicy/{svm}/persistent-stores"}' \
  /tmp/result.json

6. HYBRID Mode Idempotency

The problem

In HYBRID mode, both the EventBridge Scheduler (polling) and the FPolicy EventBridge Rule (event-driven) can trigger processing for the same file. Without deduplication, the same file gets processed twice.

The solution

A DynamoDB-based Idempotency Store with TTL:

Table: fsxn-s3ap-idempotency-store
  pk: "{uc_name}#{file_path}"
  sk: "{operation_type}#{timestamp_bucket}"
  ttl: current_time + 7 days

The timestamp_bucket rounds timestamps to 5-minute windows. Two events for the same file within the same 5-minute window are considered duplicates.

Step Functions integration

The Idempotency Checker runs as the first step in any UC's Step Functions workflow:

{
  "StartAt": "IdempotencyCheck",
  "States": {
    "IdempotencyCheck": {
      "Type": "Task",
      "Resource": "${IdempotencyCheckerFunction.Arn}",
      "Next": "CheckDuplicate"
    },
    "CheckDuplicate": {
      "Type": "Choice",
      "Choices": [{
        "Variable": "$.idempotency.is_duplicate",
        "BooleanEquals": true,
        "Next": "SkipDuplicate"
      }],
      "Default": "ProcessEvent"
    }
  }
}

Race conditions are handled via DynamoDB conditional writes (attribute_not_exists(pk)). If two executions race, only one succeeds — the other gets ConditionalCheckFailedException and skips.

Tuning considerations

The 5-minute bucket is intentionally conservative for HYBRID-mode deduplication. UCs that require multiple legitimate updates to the same file within a short interval can tune the bucket size via the DEDUP_WINDOW_MINUTES environment variable, or include an additional event attribute (such as file size or ONTAP event sequence information) in the sort key to distinguish genuinely distinct events from duplicates.

Live E2E verification

Verified the deduplication mechanism directly against the deployed DynamoDB table:

1st PutItem (pk=legal-compliance#/legal/audit/report.pdf, sk=create#2026-05-15T10:35):
  → Success (new record created)

2nd PutItem (same key, condition: attribute_not_exists(pk)):
  → ConditionalCheckFailedException ✅ (duplicate detected)

This proves the table-level duplicate rejection mechanism used by HYBRID mode. When the Idempotency Checker is the first Step Functions task, the second execution can be rejected before downstream processing starts.

7. FR-2 Migration Path

If/when native S3AP notifications become available through the FR-2 track, the migration is designed to be parameter-change-only for UCs that do not depend on FPolicy-only fields:

Phase	TriggerMode	FPolicy	S3AP Notifications
A (parallel)	HYBRID	Active	Active
B (cutover)	EVENT_DRIVEN	Disabled	Active
C (cleanup)	EVENT_DRIVEN	Removed	Active

Schema compatibility challenges

FPolicy field	S3AP equivalent	Gap
`user_name`	N/A	S3AP may not include NTFS user info
`operation_type: rename`	N/A	S3 events don't have rename
`protocol`	Always "s3"	Loss of NFS/SMB distinction

UCs that depend on user_name (permission-aware scenarios) may need to maintain FPolicy even after FR-2 GA.

Full migration path documented in docs/guides/fr2-migration-path.md.

8. Test Results

Category	Count	Result
Existing tests (Phase 1-10)	391	All PASS ✅
protobuf parser tests	18	All PASS ✅
Idempotency checker tests	10	All PASS ✅
FPolicy engine tests	16	All PASS ✅
Skipped (handler refactored)	3	Expected ⏭️
Total	435 + 3 skipped	All PASS

CloudFormation validation

Method	Result
cfn_yaml parse (all 17 UCs)	17/17 PASS
`aws cloudformation validate-template`	17/17 PASS
shared templates (observability, idempotency, OAM link)	4/4 PASS

9. Deployed AWS Resources

Stack	Resources	Status
`fsxn-shared-observability`	OAM Sink, Dashboard, SNS, X-Ray Group, IAM Roles	✅
`fsxn-idempotency-store`	DynamoDB (PAY_PER_REQUEST, TTL, PITR)	✅
`fsxn-fpolicy-routing`	EventBridge Bus, Bridge Lambda, Idempotency Table	✅
`fsxn-fp-srv`	ECS Fargate Cluster, FPolicy Server Service	✅
`fsxn-fpolicy-ingestion`	SQS Queue, DLQ, IP Updater Lambda	✅

ONTAP resources

Resource	Status
FPolicy policy `fpolicy_aws`	Enabled, persistent_store attached
Persistent Store `fpolicy_aws_store`	Active (1GB volume)
Engine format	XML (protobuf tested, reverted due to framing)

Post-deployment health check (2026-05-15)

Component	Status	Detail
FPolicy Server (ECS Fargate)	✅ Running	ONTAP connecting every 10s
SQS Ingestion Queue	✅ Empty (0/0/0)	No stuck messages
FPolicy Policy	✅ Enabled	persistent_store + engine attached
DynamoDB Idempotency	✅ Active	TTL enabled, PITR on
SNS Alerts	⚠️ PendingConfirmation	Email subscription awaiting confirmation
EventBridge Custom Bus	✅ Operational	Dispatch routing verified via put-events

10. Deployment Learnings

Issue	Root Cause	Fix
`validate-template` fails for autonomous-driving	Template exceeds 51,200 byte inline limit	Use S3 URL for validation; added CI job
`AWS::Logs::Destination` creation fails	Requires Kinesis target, not Log Group	Removed LogDestination, use Log Group directly
OAM Link same-account error	AWS design: links only work cross-account	Documented; provided workload-account template
SchedulerRole created in EVENT_DRIVEN mode	Missing Condition on SchedulerRole	Added `Condition: IsPollingOrHybrid` to 14 templates
protobuf messages rejected as invalid length	Different TCP framing in protobuf mode	Documented; XML mode maintained for stability
`test_fpolicy_engine` import errors	Handler refactored to IP Updater	Added missing exports; skipped 3 legacy tests
Persistent Store `autoflush_enabled` rejected	Parameter name not supported in REST API	Removed; ONTAP uses defaults
Policy modification while enabled	ONTAP rejects PATCH on enabled policy	Disable → modify → re-enable sequence
`.pdf` suffix causes multi-UC fan-out	EventBridge OR evaluation within array	Document: use prefix as primary filter
EventBridge → CloudWatch Logs delivery fails	Missing resource policy on log group	Added `logs:PutLogEvents` permission for events.amazonaws.com

11. Production Adoption Guidance

Recommended rollout model

TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption. A detailed guide with rollback criteria, UC classification, and CloudFormation behavior matrix is available in docs/guides/triggermode-rollout.md. The summary:

Start with POLLING for all UCs to preserve existing behavior.
Enable the shared FPolicy ingestion pipeline and validate EventBridge routing with put-events.
Move one low-risk UC to HYBRID and observe duplicate rate, Step Functions success rate, and SQS backlog.
Move latency-sensitive UCs to EVENT_DRIVEN after routing and idempotency validation.
Keep compliance-sensitive UCs in HYBRID until Persistent Store replay is validated end to end.

Rollback: At any stage, reverting TriggerMode to the previous value via CloudFormation stack update restores the CloudFormation-managed resources for the prior mode. Operators should wait for stack update completion and verify scheduler/rule state, SQS backlog, and Step Functions executions before declaring rollback complete. The sequence is always EVENT_DRIVEN → HYBRID → POLLING (never skip HYBRID when rolling back from EVENT_DRIVEN in production).

Security guardrails for ONTAP API automation

The ontap_api action is intended for controlled operations automation, not as an unrestricted ONTAP proxy. The handler implementation (shared/lambdas/fpolicy_engine/handler.py) enforces:

Path allowlist: Only /api/protocols/fpolicy/, /api/storage/volumes, /api/storage/aggregates, and /api/cluster/jobs/ are permitted. All other paths return HTTP 403.
DELETE method restriction: Disabled by default. Requires explicit ONTAP_API_ALLOW_DELETE=true environment variable to enable.
Log redaction: Only method and path are logged — request bodies containing credentials are never written to CloudWatch Logs.
Structured audit log: Each invocation emits a structured log line with method, path, status, correlation_id, and request timestamp. Caller identity can be correlated via CloudTrail Lambda Invoke events without logging sensitive request/response bodies.

Production deployments should additionally restrict Lambda invoke permissions to deployment automation roles only, and store ONTAP credentials in Secrets Manager with rotation planning.

Pass correlation_id in the event payload to trace ONTAP API operations across deployment automation, Lambda logs, and operational runbooks.

MSP and multi-customer naming

For MSP or multi-customer deployments, parameterize shared resource names with CustomerId, EnvironmentName, and Region to avoid cross-tenant naming collisions — for example: {customer}-{env}-fsxn-fpolicy-events and {customer}-{env}-s3ap-idempotency-store. Full naming guidance with CloudFormation examples is in docs/guides/triggermode-rollout.md.

TriggerMode governance

For enterprise rollout, treat TriggerMode as a governed operational control. Changes from POLLING to HYBRID or EVENT_DRIVEN should be reviewed with routing test results, idempotency validation, alarm readiness, and rollback owner assignment. Track TriggerMode changes through your change management process (Change Manager, GitOps PR, or deployment pipeline logs) — not just CloudFormation stack events.

Event payload sensitivity

For public-sector or regulated workloads, file paths and FPolicy metadata should be treated as potentially sensitive data. In regulated environments, metadata is data — file paths, user names, and protocol context should be classified before being forwarded to cross-account observability systems. Production deployments should define which event fields are logged, masked, hashed, or excluded before forwarding to cross-account observability or long-term audit storage. A data classification guide is available in docs/guides/data-classification.md.

For regulated workloads, duplicate suppression should not mean audit disappearance; skipped duplicate events should still be recorded with correlation IDs and deduplication decisions. See docs/guides/compliance-audit-ledger.md for the audit ledger design.

File readiness for event-driven pipelines

For large files, an FPolicy create or write event may arrive before the file write is complete — particularly with NFSv3 which lacks close semantics. UCs that process large analytics, imaging, geospatial, or EDA files should combine event-driven triggering with a readiness strategy:

Rename-based commit: Write to a temporary path, rename to final path on completion. Process only rename events.
Marker file: Write a .done or _SUCCESS marker after the primary file is complete. Trigger on marker creation.
Size-stability check: Poll file size at N-second intervals; start processing when size is stable across two consecutive checks.

The existing WRITE_COMPLETE_DELAY_SEC (default 5s) in the FPolicy server provides a basic delay, but is insufficient for multi-GB files. A fixed delay should be treated as a fallback, not a correctness guarantee. The new UC checklist (docs/guides/new-uc-checklist.md) includes file readiness as a required design decision for large-file UCs.

Recommended operational alarms

A ready-to-deploy CloudFormation template (shared/cfn/recommended-alarms.yaml) defines the following alarms. Severity labels are examples and should be mapped to each organization's incident classification model.

Metric	Condition	Severity
SQS `ApproximateAgeOfOldestMessage`	> 300 seconds for 5 minutes	SEV2
SQS DLQ `ApproximateNumberOfMessagesVisible`	> 0	SEV2
Step Functions `ExecutionsFailed`	> 0 for critical production UCs	SEV2
ECS `RunningTaskCount` < `DesiredTaskCount`	for > 60 seconds	SEV1
DynamoDB `ThrottledRequests`	> 0	SEV3

The ECS desired-vs-running alarm may require Container Insights, metric math, or a custom service health metric depending on how ECS service metrics are emitted in the target account. For high-volume batch UCs, failure-rate-based alarms may be less noisy than absolute failure-count alarms.

Deploy as a standalone monitoring stack or integrate into each UC template's EnableCloudWatchAlarms section.

Initial SLO candidates

While formal SLO definition is a Phase 12 deliverable, the following targets serve as initial operational guidance:

99% of events delivered to SQS within 60 seconds under normal load
FPolicy server reconnect within 60 seconds after ECS task replacement
SQS backlog recovered within 5 minutes after planned maintenance
Step Functions start latency under 2 minutes for EVENT_DRIVEN UCs

Persistent Store sizing

For environments requiring Persistent Store, size the volume based on expected outage duration:

required_size = event_rate_per_sec × max_outage_duration_sec × avg_event_size_bytes × safety_factor

Example: 100 events/sec × 300s outage × 500 bytes × 2.0 safety ≈ 30 MB of raw event data. The 1 GB volume configured in Phase 11 provides room for roughly 2 million 500-byte event records before applying operational safety margin; with a 2.0 safety factor, treat the practical planning capacity as closer to 1 million events. High-frequency environments (1000+ events/sec) should increase the volume size proportionally and validate replay rate after reconnection.

Full sizing table with scenario-based estimates is available in docs/event-driven/fpolicy-persistent-store.md.

12. Next Phase Outlook

Phase 11 completes the event-driven integration layer. Remaining work for Phase 12:

Protocol and replay validation

protobuf TCP framing: Consult NetApp support on protobuf wire format; adapt read_fpolicy_message() for frameless protobuf
Persistent Store replay E2E validation: NFS/SMB file creation during Fargate restart → verify that queued events are replayed and delivered to SQS without loss
Replay storm testing: Generate events during FPolicy server downtime, reconnect, measure replay duration, SQS ingestion rate, Step Functions concurrency, and whether downstream throttling occurs

Scale and operations

High-load testing: 1000+ events/sec stress test with Fargate scaling
SLO definition: Define event ingestion latency, processing success rate, reconnect time, and replay completion time targets
Multi-account OAM Link: Deploy workload-account-oam-link.yaml in a second account

Production rollout

Production UC deployment: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify end-to-end file operation → Step Functions execution

Already verified in Phase 11 (no longer Phase 12 candidates):

✅ EventBridge dispatch routing (put-events → rule matching → CloudWatch Logs)
✅ Idempotency Store deduplication (conditional write rejection)
✅ Persistent Store configuration (ONTAP REST API)
✅ ECS task restart + ONTAP reconnection

Who should care about Phase 11?

Platform teams can now switch any UC between polling and event-driven with a single parameter change — no template surgery required
Operations / SRE teams get Cross-Account Observability with a pre-built dashboard, recommended alarm thresholds, and a rollout/rollback model
Compliance teams get Persistent Store support to close the tested Fargate restart event-loss window, with full replay validation planned for Phase 12
Security teams get documented guardrails for the ONTAP API automation path, including allowlist, audit recommendations, and event payload sensitivity guidance
Architecture teams get a documented FR-2 migration path — if/when native S3AP notifications become available, the transition is a parameter change for compatible UCs
Data engineering teams get file-readiness guidance for large-file analytics pipelines where event arrival precedes write completion
MSPs and partners get cross-account templates, tenant-aware naming guidance, and a standardized TriggerMode control for multi-customer deployments
Performance engineers get protobuf evaluation data (34.6% size reduction) and a clear path to enabling it once TCP framing is resolved
DevOps teams get CI-integrated template validation (cfn_yaml + validate-template) catching issues before deployment

Conclusion

Phase 11 transforms the FPolicy event-driven pipeline from a single-UC reference implementation into a production-ready, 17-UC integrated system. TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption, enabling platform teams to move individual UCs through POLLING → HYBRID → EVENT_DRIVEN at their own pace.

UC-specific EventBridge rules handle routing complexity through path-prefix ownership boundaries, while the Idempotency Store prevents duplicate processing in HYBRID mode. Persistent Store closes the known Fargate restart event-loss window at the ONTAP configuration layer, while Phase 12 will validate replay completeness with real NFS/SMB file operations.

The protobuf evaluation yielded a valuable real-world finding: ONTAP uses different TCP framing for protobuf messages than for XML. The field-level parser is validated against test fixtures, but the transport layer needs adaptation — a focused Phase 12 task requiring NetApp consultation rather than a blocker.

With 435 passing tests, 17 validated templates, 5 deployed CloudFormation stacks, production adoption guidance (rollout model, governance, security guardrails, event payload sensitivity, file readiness, alarm thresholds, Persistent Store sizing), and comprehensive documentation, Phase 11 delivers the operational maturity needed for enterprise-grade event-driven file workflows on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10

Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Fri, 15 May 2026 03:51:55 +0000

What This Post Covers

This is a companion article to the FSx for ONTAP S3 Access Points Serverless Patterns series. While that series focuses on serverless patterns for FSx for ONTAP S3 Access Points across industries, this post covers the v4.2 release of the Agentic Access-Aware RAG system — a permission-aware RAG application built on FSx for ONTAP + Amazon Bedrock, production-grade in the sense of CI coverage, permission filtering, guardrails, and deployment parameterization — while some v4.2 features still have follow-up E2E items listed in What's Next.

The v4.2 release adds five features that address real-world enterprise needs: intelligent model routing for cost optimization, SFTP-based document ingestion for partners who can't use web UIs, automatic KB synchronization, operational guardrails for FSx ONTAP automation, and voice-based interaction via WebRTC.

1. Smart Routing Model Expansion

The Problem

Enterprise RAG workloads have wildly different complexity levels. A simple "What's the office address?" query doesn't need the same model as "Analyze the Q4 financial report across all subsidiaries and identify cost reduction opportunities." Routing everything through a single model either wastes money or delivers poor quality.

The Solution: 3-Tier Automatic Routing

The default routing tiers are configured for the model set currently enabled in this deployment:

Simple (greetings, factual lookups) → Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0)
Complex (analysis, comparison, summarization) → Claude 3.5 Sonnet v2 (anthropic.claude-3-5-sonnet-20241022-v2:0)
Full-context (multi-document reasoning, financial analysis) → Claude Opus 4 (anthropic.claude-opus-4-0-20250514-v1:0)

The exact model IDs are deployment parameters (lightweightModelId, powerfulModelId, heavyModelId), so teams can update to newer Sonnet/Opus releases without changing the routing logic.

┌─────────────────────────────────────────────────────┐
│                  User Query                          │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Complexity     │
              │  Classifier     │
              └───┬────┬────┬───┘
                  │    │    │
         Simple   │    │    │  Full-context
                  ▼    ▼    ▼
        ┌──────┐ ┌──────┐ ┌──────┐
        │Haiku │ │Sonnet│ │ Opus │
        │ 4.5  │ │3.5 v2│ │  4   │
        └──────┘ └──────┘ └──────┘

The cost labels below are illustrative per-query estimates for typical RAG prompts (~1K input tokens, ~500 output tokens) in this deployment, not fixed model prices. Actual cost depends on input/output tokens, prompt caching, region, and inference configuration.

Tier	Illustrative per-query cost
Haiku 4.5	~$0.001
Sonnet 3.5 v2	~$0.01
Opus 4	~$0.10

Additionally, GPT-5.5 can be exposed as a manual selection option when OpenAI models on Amazon Bedrock are enabled for the account. In this deployment, the manual route is parameterized as openai.gpt-5-5, but teams should verify the exact model ID, Region availability, inference profile, and preview access status in their own AWS account.

If the selected model is unavailable or throttled, the router falls back to the next configured tier and emits a RoutingFallback metric.

Implementation

The classifier analyzes query characteristics — keyword count, presence of analytical terms, document references, context size — and routes to the appropriate tier:

// complexity-classifier.ts
export function classifyQuery(
  query: string, contextSize: number, threshold: number
): ClassificationResult {
  const features = extractFeatures(query);

  if (features.isGreeting || features.wordCount < 5) 
    return { classification: 'simple', confidence: 0.9 };
  if (features.hasAnalyticalTerms || contextSize > threshold) 
    return { classification: 'full-context', confidence: 0.8 };
  return { classification: 'complex', confidence: 0.7 };
}

CloudWatch EMF metrics track routing decisions, enabling cost analysis and route distribution monitoring:

Namespace: SmartRouting
Metrics: RoutingCount
Dimensions: RoutingTier (simple | complex | full-context | manual)

2. Transfer Family FSx ONTAP Ingestion

The Problem

Many enterprise partners — law firms, auditors, regulatory bodies — exchange documents via SFTP. They won't adopt a web UI. But their documents still need to flow into the RAG knowledge base with proper permission metadata.

Prerequisites and Limits

This pattern assumes:

FSx for ONTAP is running ONTAP 9.17.1 or later
The FSx file system and S3 Access Point are in the same AWS Region
The same AWS account owns the file system and access point
Transfer Family file operations follow the FSx S3 Access Point compatibility limits, including the 5 GB upload limit and unsupported rename/append operations

The Solution: SFTP → S3 Access Point → Bedrock KB

This feature bridges AWS Transfer Family with the existing permission-aware RAG pipeline. The architecture aligns with the approach described in the AWS Storage Blog — internal users access data via SMB/NFS, while external partners use SFTP, all reading/writing to the same FSx for ONTAP file system through S3 Access Points.

┌──────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Partner │     │ Transfer Family │     │ FSx ONTAP        │
│  (SFTP)  │────▶│ SFTP Server     │────▶│ S3 Access Point  │
└──────────┘     └─────────────────┘     └────────┬─────────┘
                                                   │
                                    ┌──────────────▼──────────────┐
                                    │  EventBridge Scheduler      │
                                    │  (5-min polling)            │
                                    └──────────────┬──────────────┘
                                                   │
                              ┌─────────────────────▼─────────────────────┐
                              │         Ingestion Trigger Lambda           │
                              │  • ListObjectsV2 → detect changes         │
                              │  • Invoke Metadata Generator (async)       │
                              │  • StartIngestionJob (deduplicated)        │
                              └─────────────────────┬─────────────────────┘
                                                    │
                    ┌───────────────────────────────┬┘
                    ▼                               ▼
        ┌───────────────────┐          ┌────────────────────┐
        │ Metadata Generator│          │ Bedrock KB         │
        │ (.metadata.json)  │          │ StartIngestionJob  │
        └───────────────────┘          └────────────────────┘

This remains a polling-based sync path; an event-based CloudTrail/EventBridge mode is listed in What's Next.

Key Design Decisions

1. HomeDirectoryMappings uses S3 AP Alias, not ARN

The Transfer Family documentation explains that FSx-backed Transfer Family access uses S3 Access Point aliases, but the failure mode is not obvious: using the full ARN in HomeDirectoryMappings.Target produced cryptic access-denied errors in my deployment.

// Correct: use alias (e.g., "my-ap-ext-s3alias")
homeDirectoryMappings: [{
  entry: '/',
  target: `/${s3AccessPointAlias}/uploads/${userName}`,
}]

2. Deduplication via IN_PROGRESS check

Before triggering StartIngestionJob, the Lambda checks if a job is already running:

def should_trigger_ingestion(has_changes: bool, current_job_status: Optional[str]) -> bool:
    if not has_changes:
        return False
    if current_job_status == 'IN_PROGRESS':
        return False
    return True

3. Permission metadata auto-generation and trust boundary

When a new file is detected without a corresponding .metadata.json, the Metadata Generator Lambda creates one based on the SFTP user's permission mapping in DynamoDB:

{
  "allowed_sids": ["S-1-5-21-xxx-1001"],
  "allowed_uids": ["1001"],
  "allowed_gids": ["1001"],
  "source": "transfer-family",
  "uploaded_by": "partner-a",
  "uploaded_at": "2026-05-14T10:30:00Z"
}

The SFTP user does not supply permission metadata directly. The Metadata Generator derives it from an administrator-managed DynamoDB mapping and writes .metadata.json using a service role. Partner upload roles are scoped to their home directory (/uploads/{userName}/*).

Security note: The SFTP user's IAM role includes an explicit Deny statement for s3:PutObject and s3:DeleteObject on *.metadata.json keys within their home directory. This prevents partners from overwriting permission metadata generated by the service role.

This integrates seamlessly with the existing permission-filtering RAG pipeline.

CDK Deployment

npx cdk deploy --all \
  -c enableTransferFamily=true \
  -c s3AccessPointArn="arn:aws:s3:ap-northeast-1:ACCOUNT:accesspoint/my-ap" \
  -c transferFamilyS3ApAlias="my-ap-ext-s3alias"

3. KB Auto-Sync

The Problem

Documents on FSx for ONTAP change continuously — new files added, existing files updated. Without automatic synchronization, the Bedrock Knowledge Base becomes stale.

The Solution

A lightweight Lambda (Python 3.12) polls the S3 Access Point every 5 minutes, compares against a DynamoDB inventory, and triggers StartIngestionJob only when changes are detected. The inventory is updated after StartIngestionJob is accepted (i.e., a job_id is returned). A future enhancement will move this to a pending/commit model so ingestion jobs that fail after start do not hide changes from the next scan:

# Scan → Diff → Start job → Update inventory (on job accepted)
current_files = scan_s3_access_point(s3_ap_arn)
previous = get_inventory(table)
diff = compute_diff(current_files, previous)

if diff.has_changes:
    job_id = trigger_ingestion_if_needed(kb_id, ds_id, diff)
    if job_id:
        # Inventory updated after StartIngestionJob is accepted.
        # Future: move to pending/commit model keyed on job SUCCEEDED.
        update_inventory(table, current_files, previous, job_id)

Enable with a single context parameter:

npx cdk deploy --all -c enableKbAutoSync=true

4. Capacity Guardrails

The Problem

The FSx ONTAP operations automation (volume resize, snapshot management) can be dangerous if triggered too frequently — especially during incidents where monitoring alerts cascade.

The Solution

A guardrails module that enforces:

Per-action rate limit: Max N executions per action per time window
Daily cap: Maximum total operations per day
Cooldown: Minimum interval between consecutive executions of the same action

@with_guardrails(action_name="volume_resize", max_per_hour=3, daily_cap=10, cooldown_seconds=300)
def resize_volume(volume_id: str, new_size_gb: int):
    # Only executes if guardrails pass
    ...

State is tracked in DynamoDB with TTL-based cleanup. The update_item call uses a ConditionExpression (attribute_not_exists(action_count) OR action_count < :max_actions) to prevent concurrent requests from bypassing the daily cap. Concurrent resize requests can still succeed while capacity remains under the configured cap, but the conditional update prevents them from collectively exceeding it. CloudWatch metrics expose guardrail rejections for operational visibility.

5. Voice Chat WebRTC (Phase 2)

The Problem

Knowledge workers often want to ask questions hands-free — during meetings, while reviewing physical documents, or when multitasking.

The Solution

A Strategy pattern implementation supporting both REST-based (Phase 1) and WebRTC-based (Phase 2) voice interaction:

interface VoiceSessionStrategy {
  connect(): Promise<void>;
  disconnect(): Promise<void>;
  sendAudio(data: ArrayBuffer): Promise<void>;
  onTranscript(callback: (text: string) => void): void;
}

Phase 2 uses:

Amazon Kinesis Video Streams Signaling Channel for WebRTC negotiation
Pipecat Voice Agent on Bedrock AgentCore Runtime for speech-to-text-to-RAG-to-speech
Automatic fallback: If WebRTC connection fails, seamlessly falls back to REST-based voice

Phase 2 implements the client/server strategy and fallback behavior; full AgentCore Runtime deployment automation remains in What's Next.

The WebRTC path is implemented behind the existing voice strategy interface, but production deployments should add authentication, rate limiting, CORS tightening, sanitized logging, and input validation around the signaling and session launch APIs — as noted in the Pipecat AgentCore WebRTC KVS example.

Testing Strategy

All features are backed by comprehensive tests:

Category	Framework	Tests
CDK Assertion	Jest + aws-cdk-lib/assertions	42
Python Lambda Unit	pytest + moto	85
Property-Based	Hypothesis (Python)	6
Property-Based	fast-check (TypeScript)	12
Voice WebRTC	Jest	61
Smart Routing	Jest + fast-check	64

The Hypothesis property-based tests verify invariants like:

Change detection correctly classifies new/changed/unchanged files for any input combination
Ingestion deduplication logic is correct for all (changes × job_status) combinations
Metadata JSON always conforms to the required schema regardless of input permissions

Security & Portability

Before publishing, we ensured:

No hardcoded AWS account IDs in any public source file
Parameterized ECR repository name (ecrRepositoryName CDK prop)
Parameterized REGION in all shell scripts (${AWS_REGION:-ap-northeast-1})
Masked screenshots — AWS account IDs in console screenshots are covered
.gitignore coverage — cdk.context.json, cdk.out/, .env, .hypothesis/ all excluded

What's Next

AgentCore Runtime deployment for the Pipecat Voice Agent (currently requires CLI — CloudFormation support pending)
CloudTrail/EventBridge mode for Transfer Family ingestion (near-real-time event-based detection instead of 5-minute polling)
End-to-end SFTP upload test with actual SSH keys and partner simulation

End-to-End Architecture Flow

┌──────────────┐     ┌─────────────────┐     ┌──────────────────────────┐
│ External     │     │ Transfer Family │     │ FSx for ONTAP            │
│ Partner      │────▶│ SFTP Server     │────▶│ S3 Access Point          │
│ (SFTP)       │     └─────────────────┘     │ (data stays on FSxN)     │
└──────────────┘                              └────────────┬─────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Metadata Generator Lambda   │
                                            │ (admin-managed permissions) │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ KB Auto-Sync / Ingestion    │
                                            │ Trigger Lambda              │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Amazon Bedrock              │
                                            │ Knowledge Base              │
                                            └──────────────┬──────────────┘
                                                           │
┌──────────────┐     ┌─────────────────┐     ┌────────────▼─────────────┐
│ End User     │────▶│ Smart Routing   │────▶│ Permission-Aware RAG     │
│ (Chat/Voice) │     │ (Haiku/Sonnet/  │     │ (fail-closed: missing    │
└──────────────┘     │  Opus)          │     │  metadata = excluded)    │
                     └─────────────────┘     └──────────────────────────┘

The RAG retrieval path is designed to fail closed: if permission metadata is missing, malformed, or unverifiable for a document, that document is excluded from retrieval results rather than exposed broadly. This fail-closed behavior is the core safety boundary of the permission-aware RAG design: a document without trusted metadata is treated as not retrievable.

Known Limitations

v4.2 is production-oriented, but a few items remain follow-up work:

KB Auto-Sync currently updates inventory when StartIngestionJob is accepted rather than when the job reaches SUCCEEDED. Failed ingestion jobs may mask unprocessed changes until the pending/commit model is implemented.
Transfer Family ingestion is implemented and unit-tested; full partner-style E2E validation with SSH keys is still planned. The current auto-sync path focuses on detecting additions and updates — delete reconciliation is follow-up work.
AgentCore Runtime deployment automation is not yet CloudFormation-based; the Pipecat Voice Agent requires CLI/SDK deployment.
Voice sessions require production policies for authentication, rate limiting, transcript retention, and sanitized logging before production rollout.
Smart Routing emits routing metrics, but monthly cost dashboards, budget enforcement, and savings-vs-baseline reporting are follow-up work.
Fail-closed enforcement happens in the retrieval filtering layer: documents without valid, trusted permission metadata are excluded before the model receives context. Audit events for retrieval decisions (DocumentSuppressedByPermission) are candidates for the next release.

Manual high-cost or preview model selection (GPT-5.5) should be governed by application-level authorization and audited separately from automatic routing. The networking model — public Transfer Family endpoint vs VPC-hosted endpoint, partner IP allowlists, and private DNS requirements — should be selected per customer environment.

Who Should Care About v4.2?

AI platform teams get model routing that balances quality and cost without manual intervention.
Security teams get administrator-derived permission metadata and explicit IAM protection against metadata overwrite.
Data teams get automatic KB synchronization from FSx for ONTAP through S3 Access Points.
Partners and SIs get an SFTP-to-RAG ingestion path for customers who exchange documents with external organizations.
Operations teams get guardrails for FSx ONTAP automation actions with conditional write protection.
Application teams get a WebRTC voice strategy with REST fallback.

Conclusion

v4.2 moves the permission-aware RAG system from a secure document Q&A application toward an enterprise ingestion and interaction platform.

Smart Routing reduces model cost without removing access to stronger models. Transfer Family ingestion lets partners keep using SFTP while documents land directly on FSx for ONTAP through S3 Access Points. KB Auto-Sync keeps Bedrock Knowledge Bases fresh, Capacity Guardrails make ONTAP automation safer, and WebRTC Voice Chat opens a lower-friction interaction path.

The common theme is the same as the FSx for ONTAP S3 Access Points pattern series: keep enterprise file data on FSx for ONTAP, expose it safely through S3-compatible access paths, and automate around it with serverless and managed AWS services.

Resources

GitHub: FSx-for-ONTAP-Agentic-Access-Aware-RAG
Release: v4.2.0
Related series: FSx for ONTAP S3 Access Points Serverless Patterns
AWS Blog: Secure SFTP file sharing with AWS Transfer Family, Amazon FSx for NetApp ONTAP, and S3 Access Points
AWS Docs: Access your FSx for NetApp ONTAP file systems with Transfer Family

Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Thu, 14 May 2026 12:23:40 +0000

What This Post Covers

1. Smart Routing Model Expansion

The Problem

The Solution: 3-Tier Automatic Routing

The default routing tiers are configured for the model set currently enabled in this deployment:

Simple (greetings, factual lookups) → Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0)
Complex (analysis, comparison, summarization) → Claude 3.5 Sonnet v2 (anthropic.claude-3-5-sonnet-20241022-v2:0)
Full-context (multi-document reasoning, financial analysis) → Claude Opus 4 (anthropic.claude-opus-4-0-20250514-v1:0)

The exact model IDs are deployment parameters (lightweightModelId, powerfulModelId, heavyModelId), so teams can update to newer Sonnet/Opus releases without changing the routing logic.

┌─────────────────────────────────────────────────────┐
│                  User Query                          │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Complexity     │
              │  Classifier     │
              └───┬────┬────┬───┘
                  │    │    │
         Simple   │    │    │  Full-context
                  ▼    ▼    ▼
        ┌──────┐ ┌──────┐ ┌──────┐
        │Haiku │ │Sonnet│ │ Opus │
        │ 4.5  │ │3.5 v2│ │  4   │
        └──────┘ └──────┘ └──────┘

Tier	Illustrative per-query cost
Haiku 4.5	~$0.001
Sonnet 3.5 v2	~$0.01
Opus 4	~$0.10

If the selected model is unavailable or throttled, the router falls back to the next configured tier and emits a RoutingFallback metric.

Implementation

The classifier analyzes query characteristics — keyword count, presence of analytical terms, document references, context size — and routes to the appropriate tier:

// complexity-classifier.ts
export function classifyQuery(
  query: string, contextSize: number, threshold: number
): ClassificationResult {
  const features = extractFeatures(query);

  if (features.isGreeting || features.wordCount < 5) 
    return { classification: 'simple', confidence: 0.9 };
  if (features.hasAnalyticalTerms || contextSize > threshold) 
    return { classification: 'full-context', confidence: 0.8 };
  return { classification: 'complex', confidence: 0.7 };
}

CloudWatch EMF metrics track routing decisions, enabling cost analysis and route distribution monitoring:

Namespace: SmartRouting
Metrics: RoutingCount
Dimensions: RoutingTier (simple | complex | full-context | manual)

2. Transfer Family FSx ONTAP Ingestion

The Problem

Prerequisites and Limits

This pattern assumes:

FSx for ONTAP is running ONTAP 9.17.1 or later
The FSx file system and S3 Access Point are in the same AWS Region
The same AWS account owns the file system and access point
Transfer Family file operations follow the FSx S3 Access Point compatibility limits, including the 5 GB upload limit and unsupported rename/append operations

The Solution: SFTP → S3 Access Point → Bedrock KB

┌──────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Partner │     │ Transfer Family │     │ FSx ONTAP        │
│  (SFTP)  │────▶│ SFTP Server     │────▶│ S3 Access Point  │
└──────────┘     └─────────────────┘     └────────┬─────────┘
                                                   │
                                    ┌──────────────▼──────────────┐
                                    │  EventBridge Scheduler      │
                                    │  (5-min polling)            │
                                    └──────────────┬──────────────┘
                                                   │
                              ┌─────────────────────▼─────────────────────┐
                              │         Ingestion Trigger Lambda           │
                              │  • ListObjectsV2 → detect changes         │
                              │  • Invoke Metadata Generator (async)       │
                              │  • StartIngestionJob (deduplicated)        │
                              └─────────────────────┬─────────────────────┘
                                                    │
                    ┌───────────────────────────────┬┘
                    ▼                               ▼
        ┌───────────────────┐          ┌────────────────────┐
        │ Metadata Generator│          │ Bedrock KB         │
        │ (.metadata.json)  │          │ StartIngestionJob  │
        └───────────────────┘          └────────────────────┘

This remains a polling-based sync path; an event-based CloudTrail/EventBridge mode is listed in What's Next.

Key Design Decisions

1. HomeDirectoryMappings uses S3 AP Alias, not ARN

// Correct: use alias (e.g., "my-ap-ext-s3alias")
homeDirectoryMappings: [{
  entry: '/',
  target: `/${s3AccessPointAlias}/uploads/${userName}`,
}]

2. Deduplication via IN_PROGRESS check

Before triggering StartIngestionJob, the Lambda checks if a job is already running:

def should_trigger_ingestion(has_changes: bool, current_job_status: Optional[str]) -> bool:
    if not has_changes:
        return False
    if current_job_status == 'IN_PROGRESS':
        return False
    return True

3. Permission metadata auto-generation and trust boundary

When a new file is detected without a corresponding .metadata.json, the Metadata Generator Lambda creates one based on the SFTP user's permission mapping in DynamoDB:

{
  "allowed_sids": ["S-1-5-21-xxx-1001"],
  "allowed_uids": ["1001"],
  "allowed_gids": ["1001"],
  "source": "transfer-family",
  "uploaded_by": "partner-a",
  "uploaded_at": "2026-05-14T10:30:00Z"
}

Security note: The SFTP user's IAM role includes an explicit Deny statement for s3:PutObject and s3:DeleteObject on *.metadata.json keys within their home directory. This prevents partners from overwriting permission metadata generated by the service role.

This integrates seamlessly with the existing permission-filtering RAG pipeline.

CDK Deployment

npx cdk deploy --all \
  -c enableTransferFamily=true \
  -c s3AccessPointArn="arn:aws:s3:ap-northeast-1:ACCOUNT:accesspoint/my-ap" \
  -c transferFamilyS3ApAlias="my-ap-ext-s3alias"

3. KB Auto-Sync

The Problem

Documents on FSx for ONTAP change continuously — new files added, existing files updated. Without automatic synchronization, the Bedrock Knowledge Base becomes stale.

The Solution

# Scan → Diff → Start job → Update inventory (on job accepted)
current_files = scan_s3_access_point(s3_ap_arn)
previous = get_inventory(table)
diff = compute_diff(current_files, previous)

if diff.has_changes:
    job_id = trigger_ingestion_if_needed(kb_id, ds_id, diff)
    if job_id:
        # Inventory updated after StartIngestionJob is accepted.
        # Future: move to pending/commit model keyed on job SUCCEEDED.
        update_inventory(table, current_files, previous, job_id)

Enable with a single context parameter:

npx cdk deploy --all -c enableKbAutoSync=true

4. Capacity Guardrails

The Problem

The FSx ONTAP operations automation (volume resize, snapshot management) can be dangerous if triggered too frequently — especially during incidents where monitoring alerts cascade.

The Solution

A guardrails module that enforces:

Per-action rate limit: Max N executions per action per time window
Daily cap: Maximum total operations per day
Cooldown: Minimum interval between consecutive executions of the same action

@with_guardrails(action_name="volume_resize", max_per_hour=3, daily_cap=10, cooldown_seconds=300)
def resize_volume(volume_id: str, new_size_gb: int):
    # Only executes if guardrails pass
    ...

5. Voice Chat WebRTC (Phase 2)

The Problem

Knowledge workers often want to ask questions hands-free — during meetings, while reviewing physical documents, or when multitasking.

The Solution

A Strategy pattern implementation supporting both REST-based (Phase 1) and WebRTC-based (Phase 2) voice interaction:

interface VoiceSessionStrategy {
  connect(): Promise<void>;
  disconnect(): Promise<void>;
  sendAudio(data: ArrayBuffer): Promise<void>;
  onTranscript(callback: (text: string) => void): void;
}

Phase 2 uses:

Amazon Kinesis Video Streams Signaling Channel for WebRTC negotiation
Pipecat Voice Agent on Bedrock AgentCore Runtime for speech-to-text-to-RAG-to-speech
Automatic fallback: If WebRTC connection fails, seamlessly falls back to REST-based voice

Phase 2 implements the client/server strategy and fallback behavior; full AgentCore Runtime deployment automation remains in What's Next.

Testing Strategy

All features are backed by comprehensive tests:

Category	Framework	Tests
CDK Assertion	Jest + aws-cdk-lib/assertions	42
Python Lambda Unit	pytest + moto	85
Property-Based	Hypothesis (Python)	6
Property-Based	fast-check (TypeScript)	12
Voice WebRTC	Jest	61
Smart Routing	Jest + fast-check	64

The Hypothesis property-based tests verify invariants like:

Change detection correctly classifies new/changed/unchanged files for any input combination
Ingestion deduplication logic is correct for all (changes × job_status) combinations
Metadata JSON always conforms to the required schema regardless of input permissions

Security & Portability

Before publishing, we ensured:

No hardcoded AWS account IDs in any public source file
Parameterized ECR repository name (ecrRepositoryName CDK prop)
Parameterized REGION in all shell scripts (${AWS_REGION:-ap-northeast-1})
Masked screenshots — AWS account IDs in console screenshots are covered
.gitignore coverage — cdk.context.json, cdk.out/, .env, .hypothesis/ all excluded

What's Next

AgentCore Runtime deployment for the Pipecat Voice Agent (currently requires CLI — CloudFormation support pending)
CloudTrail/EventBridge mode for Transfer Family ingestion (near-real-time event-based detection instead of 5-minute polling)
End-to-end SFTP upload test with actual SSH keys and partner simulation

End-to-End Architecture Flow

┌──────────────┐     ┌─────────────────┐     ┌──────────────────────────┐
│ External     │     │ Transfer Family │     │ FSx for ONTAP            │
│ Partner      │────▶│ SFTP Server     │────▶│ S3 Access Point          │
│ (SFTP)       │     └─────────────────┘     │ (data stays on FSxN)     │
└──────────────┘                              └────────────┬─────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Metadata Generator Lambda   │
                                            │ (admin-managed permissions) │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ KB Auto-Sync / Ingestion    │
                                            │ Trigger Lambda              │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Amazon Bedrock              │
                                            │ Knowledge Base              │
                                            └──────────────┬──────────────┘
                                                           │
┌──────────────┐     ┌─────────────────┐     ┌────────────▼─────────────┐
│ End User     │────▶│ Smart Routing   │────▶│ Permission-Aware RAG     │
│ (Chat/Voice) │     │ (Haiku/Sonnet/  │     │ (fail-closed: missing    │
└──────────────┘     │  Opus)          │     │  metadata = excluded)    │
                     └─────────────────┘     └──────────────────────────┘

Known Limitations

v4.2 is production-oriented, but a few items remain follow-up work:

KB Auto-Sync currently updates inventory when StartIngestionJob is accepted rather than when the job reaches SUCCEEDED. Failed ingestion jobs may mask unprocessed changes until the pending/commit model is implemented.
Transfer Family ingestion is implemented and unit-tested; full partner-style E2E validation with SSH keys is still planned. The current auto-sync path focuses on detecting additions and updates — delete reconciliation is follow-up work.
AgentCore Runtime deployment automation is not yet CloudFormation-based; the Pipecat Voice Agent requires CLI/SDK deployment.
Voice sessions require production policies for authentication, rate limiting, transcript retention, and sanitized logging before production rollout.
Smart Routing emits routing metrics, but monthly cost dashboards, budget enforcement, and savings-vs-baseline reporting are follow-up work.
Fail-closed enforcement happens in the retrieval filtering layer: documents without valid, trusted permission metadata are excluded before the model receives context. Audit events for retrieval decisions (DocumentSuppressedByPermission) are candidates for the next release.

Who Should Care About v4.2?

AI platform teams get model routing that balances quality and cost without manual intervention.
Security teams get administrator-derived permission metadata and explicit IAM protection against metadata overwrite.
Data teams get automatic KB synchronization from FSx for ONTAP through S3 Access Points.
Partners and SIs get an SFTP-to-RAG ingestion path for customers who exchange documents with external organizations.
Operations teams get guardrails for FSx ONTAP automation actions with conditional write protection.
Application teams get a WebRTC voice strategy with REST fallback.

Conclusion

v4.2 moves the permission-aware RAG system from a secure document Q&A application toward an enterprise ingestion and interaction platform.

Resources

GitHub: FSx-for-ONTAP-Agentic-Access-Aware-RAG
Release: v4.2.0
Related series: FSx for ONTAP S3 Access Points Serverless Patterns
AWS Blog: Secure SFTP file sharing with AWS Transfer Family, Amazon FSx for NetApp ONTAP, and S3 Access Points
AWS Docs: Access your FSx for NetApp ONTAP file systems with Transfer Family

FPolicy Event-Driven Pipeline, Multi-Account StackSets, and Cost Optimization — FSx for ONTAP S3 Access Points, Phase 10

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Thu, 14 May 2026 04:44:54 +0000

TL;DR

This is Phase 10 of the FSx for ONTAP S3 Access Points serverless pattern library. Building on Phase 9, Phase 10 delivers:

FPolicy event-driven integration: ONTAP FPolicy → ECS Fargate TCP server → SQS → EventBridge custom bus. The shared event-ingestion pipeline is verified end-to-end; UC-specific dispatch follows in Phase 11.
Multi-account StackSets: All 17 UC templates validated for StackSets compatibility (0 errors) + admin/execution role templates
UC-specific alarm profiles: BATCH / REALTIME / HIGH_VOLUME — three profiles with workload-appropriate thresholds
Cost optimization: Dynamic MaxConcurrency controller + business-hours scheduling (rate(1h) vs rate(6h))
E2E verification: NFSv3 ✅, NFSv4.0 ✅, NFSv4.1 ✅, SMB ✅, NFSv4.2 ❌ (unsupported by ONTAP FPolicy)

In short: Phase 9 completed the operational baseline. Phase 10 builds and verifies the shared event-ingestion pipeline that the pattern library has needed since Phase 1 — without waiting for AWS to ship native S3AP notifications. UC-specific dispatch wiring follows in Phase 11.

📊 Repository stats: 17 industry use cases + event-driven FPolicy + 6 FlexCache/FlexClone patterns | 1,499+ tests | 126 test files | Python 3.12 + CloudFormation (SAM Transform)

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1a. Trigger Mode Decision Guide

Before diving into the FPolicy implementation, here is the decision framework for choosing between the three trigger modes this library supports:

Mode	Choose when	Avoid when
POLLING	Hourly or batch processing is acceptable; simplest operating model	Sub-minute detection is required
EVENT_DRIVEN	Near-real-time ingestion is required and event loss during reconnect is acceptable	Compliance requires durable event capture without Persistent Store
HYBRID	You need faster detection plus periodic reconciliation to fill gaps	You want the simplest operating model

Dimension	POLLING	EVENT_DRIVEN	HYBRID
Detection latency	Minutes to hours	Seconds	Seconds + periodic catch-up
Monthly cost (infra only)	~$6-21	~$32-60	~$42-86
Operational complexity	Low	High	Highest
Event durability	High (full scan each time)	Medium (gap during restart)	High (reconciliation fills gaps)
ONTAP dependency	None (S3 AP only)	High (FPolicy config)	High

Decision flow:

Real-time detection not required → POLLING (start here for most workloads)
Real-time required + Persistent Store available (ONTAP 9.14.1+) → EVENT_DRIVEN
Real-time required + no Persistent Store → HYBRID (polling fills gaps)

Full guide: Trigger Mode Decision Guide

1. FPolicy Event-Driven Architecture

Background: why FPolicy

Every UC in this pattern library runs on a polling model: EventBridge Scheduler → Discovery Lambda → ListObjectsV2. This works, but it means latency is bounded by the polling interval (typically 1 hour). AWS still does not support GetBucketNotificationConfiguration for S3 Access Points attached to FSx for ONTAP volumes (FR-2 remains open).

ONTAP FPolicy is a file-operation notification framework built into every ONTAP system. In external server mode, it sends TCP notifications for create/write/delete/rename events to a registered server. By connecting this to AWS services, we get near-real-time event-driven processing without waiting for FR-2.

This implementation builds on Shengyu Fang's reference implementation, adapted for the 17-UC pattern library architecture.

S3 API Does Not Remove File-System Semantics

S3 Access Points for FSx for ONTAP expose file data through S3 APIs, but authorization is a two-layer model:

AWS-side authorization: IAM identity-based policy, S3 Access Point resource policy, VPC endpoint policy, SCP — all relevant policies are evaluated and all must permit the request
File-system-side authorization: The file system identity (UNIX UID or Windows domain\user) associated with the access point determines what file operations are authorized based on that user's permissions on the underlying volume

This means that least-privilege design must cover both AWS IAM and ONTAP file permissions. A common mistake is securing only the IAM layer while using a root-equivalent file system identity (UID 0), which grants full access to all files regardless of IAM restrictions.

Key behaviors:

If the file system user has read-only access, write requests through the access point are blocked — even if IAM permits s3:PutObject
Attaching an S3 access point does not change the volume's behavior when accessed via NFS or SMB
Block Public Access is always enabled and cannot be changed for FSx for ONTAP access points

For the full authorization model documentation, see S3AP Authorization Model.

Architecture

FSx ONTAP SVM (file operations: create/write/delete/rename)
│
│ TCP (port 9898, async mode)
▼
FPolicy External Server (ECS Fargate, ARM64 Python 3.12)
│
├─ [Near-real-time] → SQS Ingestion Queue
│                        │
│                        │ Event Source Mapping
│                        ▼
│                     Bridge Lambda → EventBridge Custom Bus
│                                          │
│                                   UC1 reference rule (Phase 10)
│                                          │
│                                   UC1 Step Functions
│
│                                   ── Phase 11 ──
│                                   UC-specific dispatch rules
│                                   → Step Functions / Lambda (per-UC)
│
└─ [Batch] → JSON Lines log (FSxN S3AP) → Log Query Lambda

ONTAP initiates the TCP connection to the FPolicy server — not the other way around. This means the server simply listens on a port. Because ONTAP maintains a persistent TCP control channel with keep-alive, Lambda is not viable (15-minute timeout). ECS Fargate provides the long-running TCP listener without OS management overhead.

Why not NLB?

Initial design placed an NLB in front of Fargate for IP stability. In our AWS verification, the NLB path established a TCP connection but the FPolicy handshake did not complete.

Additional verification (2026-05-14): We tested both preserve_client_ip.enabled=true and false on the NLB target group. In both configurations, ONTAP did not establish an FPolicy session through the NLB. The only connections observed from the NLB IP were health checks (TCP connect → immediate close at 10-second intervals). No FPolicy NEGO_REQ was received via the NLB path.

One plausible explanation is that ONTAP FPolicy's external-engine expects a direct TCP connection to the primary-servers IP. When the NLB forwards the connection to a Fargate task with a different IP, the FPolicy session establishment conditions are not met — possibly because ONTAP validates the connection endpoint or because the NLB's connection lifecycle (idle timeout, deregistration delay) interferes with the persistent control channel that FPolicy requires.

This remains documented as an observed deployment limitation in our environment (FSxN ONTAP 9.17.1P6, internal NLB with IP targets), not a universal NLB claim. If your environment differs, testing the NLB path is straightforward — set the NLB IP as the external-engine primary-servers and check vserver fpolicy show-engine for connection state.

Solution: Fargate task direct IP connection. IP stability is handled by an EventBridge-triggered Lambda that updates the ONTAP external-engine configuration when the Fargate task IP changes:

ECS Task State Change (RUNNING) → EventBridge Rule → IP Updater Lambda
→ ONTAP REST API: disable policy → update engine primary_servers → enable policy

The direct-IP model assumes a single active Fargate task (DesiredCount: 1) and requires network reachability from the FSxN SVM data LIFs to the task ENI on the FPolicy TCP port. This design prioritizes connection stability over horizontal scalability; multi-task active-active configurations are not supported due to FPolicy session constraints. Security groups must allow ONTAP-initiated inbound connections on port 9898. During Fargate task restarts, event handling depends on the FPolicy policy's is-mandatory setting: with is-mandatory=false (our configuration), file operations continue unblocked but notifications are dropped until the new task connects. See the Event durability note below for Persistent Store guidance.

TriggerMode parameter

Phase 10 introduces the TriggerMode parameter scaffolding and verifies the shared FPolicy → SQS → EventBridge pipeline end-to-end. A reference implementation is deployed in the legal-compliance (UC1) template. UC-specific Step Functions dispatch rules are intentionally deferred to Phase 11.

Value	Phase 10 behavior
`POLLING` (default)	Existing EventBridge Scheduler + Discovery Lambda
`EVENT_DRIVEN`	Shared FPolicy event pipeline enabled; UC-specific dispatch wiring is Phase 11
`HYBRID`	Polling remains active; event-driven deduplication path prepared for Phase 11

Default POLLING ensures zero impact on existing deployments.

NFSv3 write-complete delay

When FPolicy fires a notification, the file write may not be complete — particularly with NFSv3 which lacks close semantics. The server inserts a configurable delay (WRITE_COMPLETE_DELAY_SEC, default 5s) after receiving NOTI_REQ, and Step Functions include retry logic for incomplete files.

Event durability note

This Phase 10 implementation is designed for near-real-time processing, not end-to-end durable event capture during Fargate task restarts. With is-mandatory=false, ONTAP drops notifications when no FPolicy server is connected — file operations continue unblocked but events are lost. Environments that cannot tolerate event loss should evaluate ONTAP FPolicy Persistent Store (ONTAP 9.14.1+), available for asynchronous non-mandatory external FPolicy policies. Persistent Store queues events on the SVM during server disconnection and can replay them when the external server reconnects. Note that queue sizing, replay handling, and deduplication require application-level design. This is a Phase 11+ candidate (design-dependent).

Note (Phase 12 update): This Phase 10 article documents the initial event-durability boundary. Persistent Store replay validation is covered in Phase 12, where replay behavior was tested for 5-event and 20-event disconnect scenarios with zero event loss confirmed. Use the Deployment Profiles guide to choose the appropriate durability level for your workload.

Deployment Profiles — From PoC to Compliance

The event-driven FPolicy pattern supports three deployment profiles, each with clear boundaries for event loss tolerance and operational complexity:

Dimension	PoC/Demo	Production	Compliance-sensitive
FPolicy Server	Fargate (direct IP)	EC2 static IP or NLB	EC2 static IP + NLB
`is-mandatory`	`false`	`true` (ONTAP 9.15.1+)	`true` (ONTAP 9.15.1+)
Persistent Store	Not required	Recommended	Required (ONTAP 9.14.1+)
Retry / Dedup	Best-effort	DynamoDB idempotency	DynamoDB + S3 Object Lock lineage
Alarm Profile	Minimal (error only)	Full (latency + error + backlog)	Full + audit trail
Event Loss Tolerance	Acceptable (30-60s gap)	Near-zero (retry compensates)	Zero (Persistent Store + audit)

Key design decisions:

is-mandatory=true (ONTAP 9.15.1+): Blocks file operations when the FPolicy server is unavailable — prevents event loss but impacts availability. Use only with redundant server deployment.
Persistent Store (ONTAP 9.14.1+): Buffers events in a dedicated SVM volume during server disconnection. Events are replayed in order upon reconnection. Sizing: 1 GB ≈ 2M events at ~500 bytes each.
Replay recovery time: 100K buffered events at 100 events/sec = ~17 minutes to catch up.

The progression path is incremental: PoC → Production → Compliance-sensitive, adding capabilities at each stage without redesigning the core architecture.

Full profile documentation: Deployment Profiles

2. E2E Verification Results

Protocol support matrix

NFS Version	Mount Option	FPolicy NOTI_REQ	Result
NFSv3	`vers=3`	✅ Immediate	Works
NFSv4.0	`vers=4.0`	✅ Immediate	Works
NFSv4.1	`vers=4.1`	✅ Immediate	Works
NFSv4.2	`vers=4.2`	❌ Not sent	Unsupported
NFSv4 (auto)	`vers=4`	❌ Not sent	Negotiates to 4.2
SMB/CIFS	—	✅	Works

Key finding: mount -o vers=4 on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does not support. Always use vers=4.1 explicitly. This is documented in NetApp's FPolicy Auditing FAQ.

ONTAP version note: NFSv4.1 FPolicy monitoring support was introduced in ONTAP 9.15.1. Earlier versions support SMB, NFSv3, and NFSv4.0 only. Our test environment runs ONTAP 9.17.1P6, which includes NFSv4.1 support. See NetApp FPolicy event configuration documentation for the full protocol support matrix by ONTAP version.

Path extraction bug fix

ONTAP sends file paths in XML format within NOTI_REQ:

<PathNameType>WIN_NAME</PathNameType><PathName>\file.txt</PathName>

The initial regex extraction left residual XML tags in the file_path field. Fixed by adding an _extract_xml_value() helper with multi-tag fallback and residual tag stripping.

Before fix:

{"file_path": "<PathNameType>WIN_NAME</PathNameType><PathName>\\file.txt</PathName>"}

After fix:

{"file_path": "file.txt"}

volume_name / svm_name resolution

ONTAP's NOTI_REQ body does not always include volume and SVM names in a parseable location. Resolution strategy:

Extract from NEGO_REQ session context (SVM name available at handshake)
Fall back to environment variables (SVM_NAME, VOLUME_NAME) set in the ECS task definition

Complete E2E flow (verified)

NFSv3 file create (tee /mnt/fsxn/file.txt)
→ ONTAP FPolicy NOTI_REQ
→ Fargate FPolicy Server receives event
→ SQS SendMessage
→ Bridge Lambda → EventBridge Custom Bus

Actual EventBridge event:

{
  "detail-type": "FPolicy File Operation",
  "source": "fsxn.fpolicy",
  "detail": {
    "event_id": "2175e878-1e0c-48ef-a8b3-53664d5d5b06",
    "operation_type": "create",
    "file_path": "test-eb-e2e-1778707951.txt",
    "volume_name": "vol1",
    "svm_name": "FSxN_OnPre",
    "timestamp": "2026-05-13T21:32:37.680626+00:00",
    "client_ip": "10.0.10.67"
  }
}

3. Unified UC Directory Structure

Phase 10 introduces event-driven-fpolicy/ as a first-class shared pattern directory, using the same structure as the UC directories. It is not counted as one of the 17 industry UCs — it is a shared event-ingestion reference implementation that any UC can consume via EventBridge rules.

event-driven-fpolicy/
├── docs/                    # 8 languages (ja, en, ko, zh-CN, zh-TW, fr, de, es)
│   ├── architecture.md      # + .en.md, .ko.md, etc.
│   └── demo-guide.md
├── functions/
│   ├── ip_updater/          # Fargate IP → ONTAP REST API
│   └── sqs_to_eventbridge/  # Bridge Lambda
├── schemas/
│   └── fpolicy-event-schema.json
├── server/
│   ├── Dockerfile           # ARM64 Python 3.12
│   ├── fpolicy_server.py    # TCP listener + SQS sender
│   └── requirements.txt
├── tests/
├── README.md                # + 7 language variants
├── template.yaml            # Fargate deployment (ComputeType=fargate)
└── template-ec2.yaml        # EC2 deployment (ComputeType=ec2)

A single template.yaml with a ComputeType parameter (fargate/ec2) uses CloudFormation Conditions to select the appropriate resource set. The EC2 variant uses a t4g.micro with a static private IP — no IP update Lambda needed — at roughly ~$4/month. The Fargate variant avoids EC2 management but requires task-IP tracking and has a higher baseline cost (~$10/month for Fargate compute alone, plus VPC Endpoints). Actual cost varies by region, runtime hours, and VPC Endpoint configuration.

4. Multi-Account StackSets

StackSets compatibility validator

New validator scripts/check_stacksets_compatibility.py checks all 17 UC templates for:

Hardcoded Account IDs — 12-digit numeric strings that would break in other accounts
Resource name uniqueness — names must include !Sub with AccountId or StackName
Export name collisions — exports that would conflict across accounts
VPC/Subnet/SecurityGroup parameterization — must not be hardcoded

Result: 17/17 templates, 0 errors, 0 warnings.

StackSets role templates

Template	Purpose
`shared/cfn/stacksets-admin.yaml`	Admin account role for StackSet management
`shared/cfn/stacksets-execution.yaml`	Target account execution role (least-privilege)

The execution role uses an Organization ID condition in its trust policy — accounts outside the Organization cannot assume it. Permissions are scoped to Lambda, Step Functions, DynamoDB, S3, CloudWatch, EventBridge, SNS, and Secrets Manager only.

Automatic deployment

With AutoDeployment: Enabled on the StackSet, new accounts joining the Organization automatically receive the UC templates. No manual intervention required.

Scope note: Phase 10 validates that templates can be distributed safely across accounts via StackSets (deployment compatibility). It does not yet validate cross-account FSxN S3AP data access, shared VPC ownership, or centralized operations across accounts. Those runtime cross-account patterns are Phase 11+ work.

5. Alarm Profiles and Cost Optimization

UC-specific alarm profiles

Not all UCs have the same latency requirements. A batch genomics pipeline (UC3) tolerates higher failure rates than a real-time compliance monitor (UC12). Phase 10 introduces three profiles:

Profile	Failure Rate Threshold	Error Threshold	Target Workloads
BATCH	10%	3/hour	Periodic batch processing (UC1-5, UC9)
REALTIME	5%	1/hour	Real-time processing (UC10-14)
HIGH_VOLUME	15%	5/hour	High-volume file processing (UC6-8, UC15-17)

Each UC template now has an AlarmProfile parameter (BATCH / REALTIME / HIGH_VOLUME / CUSTOM). The CUSTOM option exposes CustomFailureThreshold and CustomErrorThreshold for fine-grained control.

Dynamic MaxConcurrency controller

shared/max_concurrency_controller.py calculates optimal Map state parallelism based on actual file volume:

def calculate_max_concurrency(
    detected_file_count: int,
    ontap_rate_limit: int = 100,
    api_calls_per_file: int = 3,
    upper_bound: int = 40
) -> int:
    optimal = min(
        detected_file_count,
        ontap_rate_limit // api_calls_per_file,
        upper_bound
    )
    return max(optimal, 1)

This replaces the static MaxConcurrency: 10 from Phase 8. For 500 files with default settings, it calculates min(500, 33, 40) = 33 — a 3.3x throughput improvement without exceeding ONTAP's rate limit.

Business-hours cost scheduling

With EnableCostScheduling=true, two EventBridge Schedulers dynamically adjust the polling frequency:

Time Period	Schedule
Business hours (weekday 09:00-18:00 JST)	`rate(1 hour)`
Off-hours (weekday 18:00-09:00 + weekends)	`rate(6 hours)`

BusinessHoursStart and BusinessHoursEnd parameters allow customization. The Cost Scheduler emits an EstimatedMonthlySavings CloudWatch metric for visibility.

S3 Access Points Performance Considerations

Key performance characteristics (from AWS documentation):

Latency: Tens of milliseconds (consistent with S3 bucket access)
Throughput: Depends on the FSx file system's provisioned throughput capacity — S3 AP, NFS, and SMB all share the same throughput pool
Object size limit: 5 GB for uploads (PutObject); downloads (GetObject) can be larger
Storage class: FSX_ONTAP only; SSE-FSX encryption only

Design implications for serverless pipelines:

Lambda memory → network bandwidth: Higher Lambda memory allocates more network bandwidth. For 10 MB file processing, 1,769 MB (1 vCPU) provides ~600 Mbps.
Step Functions Map concurrency: Limit MaxConcurrency based on FSx provisioned throughput. Formula: fsxn_throughput / per_lambda_throughput. Example: 512 MBps ÷ 50 MBps per Lambda ≈ 10 concurrent executions.
ListObjectsV2 pagination: MaxKeys=1000 per page. For 10,000 files = 10 pages × ~50ms = ~500ms minimum. Use Prefix filtering to reduce scope.
Shared throughput: S3 AP, NFS, and SMB all share the same FSx throughput capacity. Account for existing NFS/SMB workloads when sizing Map concurrency.
Retry strategy: Use botocore.config.Config(retries={"mode": "adaptive"}) for automatic backoff on SlowDown (503) responses.

Full analysis: S3AP Performance Considerations

6. Test Results

Category	Count	Result
Phase 10 new tests	62	All PASS ✅
Property-based tests (Hypothesis)	7 properties × 100-200 iterations	All PASS ✅
Existing tests (Phase 1-9)	982	No regressions ✅
Total	1044+	All PASS

Property-based tests

Property	What it verifies
FPolicy event round-trip	Serialize → deserialize produces equivalent object
MaxConcurrency bounds	Result always ≥ 1 and ≤ upper_bound
MaxConcurrency correctness	Result matches the min() formula
Zero files → 1	Empty input never produces 0
StackSets Account ID detection	Known violations are always caught
Cost savings non-negativity	Estimated savings ≥ 0 for all inputs
Same rate → ~0 savings	Equal business/off-hours rates produce near-zero savings

Validator results

Validator	Result
`check_s3ap_iam_patterns.py`	17/17 clean ✅
`check_handler_names.py`	87 handlers, 0 issues ✅
`check_conditional_refs.py`	17 templates, 0 issues ✅
`check_stacksets_compatibility.py`	17 templates, 0 errors ✅
`_check_sensitive_leaks.py`	160 images, 0 leaks ✅
cfn-guard IAM security	Advisory, 0 new violations ✅

7. Deployment Learnings

Several issues surfaced during AWS verification that are worth documenting:

Issue	Root Cause	Fix
NLB path: FPolicy handshake fails	ONTAP FPolicy expects direct TCP to primary-servers IP; NLB target routing does not satisfy session establishment (tested with preserve_client_ip true and false)	Direct Fargate IP + EventBridge IP auto-update
jsonschema 4.18+ fails on ARM64 Lambda	rpds-py native dependency	Pin to 4.17.x
SCHEMA_PATH differs between Lambda and local	Different working directories	Fallback path resolution
Guard Hook rejects Condition-based `Resource: "*"`	Overly strict rule	Updated rule to allow `Condition exists`
ECR pull fails in private subnet	Missing VPC Endpoints	Added ECR, STS, S3, Logs, SQS endpoints
KEEP_ALIVE timeout race	Server timeout = keep_alive_interval	Increased to 300s
NFSv4 events not firing	`vers=4` negotiates to unsupported 4.2	Explicit `vers=4.1`

7a. Beyond AI/ML — Enterprise Workload Examples

This pattern is not limited to AI/ML demos. The S3 Access Points architecture applies to any enterprise file data on FSx for ONTAP:

SAP peripheral files and exported business documents — IDoc exports, ABAP report outputs, BW data extracts. Process without changing SAP file interfaces.
EDI / HULFT landing zones — Automatic validation and format conversion of received files. No changes to existing EDI/HULFT infrastructure.
Audit evidence and compliance reports — Periodic integrity checks, retention management, with NTFS permissions preserved.
Batch output from EC2-based business applications — Add serverless post-processing pipelines without changing application output paths.
Scanned documents and regulated records — OCR, classification, PII detection on documents stored for long-term retention.

The design principle: File data stays on FSx for ONTAP. S3 Access Points provide the bridge to AWS-native automation, AI/ML, and analytics services — without data movement, without changing existing NFS/SMB access patterns. Existing backup (SnapMirror), DR, and access controls remain unchanged.

This positioning matters for partner and SI proposals: the value is not "replace your file server" but "connect your existing file data to AWS services without migration."

Full examples with architecture diagrams: Enterprise Workload Examples

8. Next Phase Outlook

Phase 10 established the shared event-ingestion pipeline (FPolicy → SQS → EventBridge). Phase 11 will wire those events into UC-specific processing. Candidates:

TriggerMode rollout to all 17 UCs: Expand the reference implementation from UC1 to all templates, with UC-specific EventBridge dispatch rules
FPolicy → UC-specific Step Functions dispatch: EventBridge rules matching file path prefixes/extensions to UC targets
protobuf format evaluation: ONTAP 9.15.1+ supports protobuf for higher-performance notifications
Cross-Account Observability live verification: Deploy the shared-services-observability template and validate metric aggregation
Persistent Store evaluation: Phase 11+ design-dependent work for compliance-sensitive environments that cannot tolerate event loss during Fargate task restarts
FR-2 migration path: When AWS ships native S3AP notifications, the TriggerMode parameter provides a clean migration — switch from EVENT_DRIVEN to native events without changing UC logic

Why Native S3AP Notifications Still Matter

This FPolicy-based pipeline proves that customers need event-driven processing for FSx for ONTAP S3 Access Points. However, it also quantifies where a native AWS-managed notification feature would eliminate undifferentiated heavy lifting:

Operational burden	Current (FPolicy)	With native notifications
Long-running TCP listener	Fargate 24/7 (~$30-50/month)	Not needed
Fargate task IP tracking	IP Updater Lambda + ONTAP REST API	Not needed
ONTAP external-engine reconfiguration	On every deployment	Not needed
FPolicy protocol dependency	NFSv4.2 not supported	Protocol-independent
Event durability semantics	Requires Persistent Store (ONTAP 9.14.1+)	S3-equivalent at-least-once
Cross-account event routing	SQS → Bridge Lambda → EventBridge → cross-account	Standard EventBridge rules

Implementation complexity: 15-20 CloudFormation resources and 2 Lambda functions (IP Updater + Bridge) for FPolicy, vs an estimated 3-5 resources for native EventBridge integration.

The FPolicy implementation is not a replacement for native S3AP notifications — it is evidence of customer demand and an interim event-driven pattern. The operational complexity documented here directly maps to the value a native feature would deliver.

Full analysis: Native S3AP Notifications Evidence

Partner/SI Delivery Checklist

For partners and SIs proposing this pattern to enterprise customers, a structured delivery checklist is available covering:

Customer workload classification — SAP-adjacent / file server / regulated records / AI analytics
Trigger mode selection — POLLING / EVENT_DRIVEN / HYBRID based on latency and durability requirements
Deployment profile — PoC / Production / Compliance-sensitive with clear boundaries
Access model design — IAM + S3 AP policy + ONTAP file permissions (dual-layer)
Network model — Private VPC / VPC Origin AP / Cross-Account / Shared Services
Operating model — Customer-operated / partner-operated / managed service
Success criteria — Latency, throughput, cost, auditability, recovery behavior

The checklist also includes a 4-phase PoC implementation guide (environment prep → POLLING verification → EVENT_DRIVEN verification → evaluation) and FAQ for common partner questions.

Full checklist: Partner/SI Delivery Checklist

Who should care about Phase 10?

Platform teams get an event-driven alternative to polling — near-real-time latency instead of hourly polling intervals
Security teams get StackSets compatibility validation ensuring no hardcoded account IDs leak across environments
Operations teams get workload-appropriate alarm thresholds that reduce alert fatigue
Finance teams get fewer off-hours polling invocations through business-hours scheduling, with savings surfaced as a CloudWatch metric
Storage teams get a documented FPolicy integration pattern with protocol-level verification results
Multi-account teams get ready-to-deploy StackSets admin/execution roles with Organization-scoped trust
Partners and SIs get a PoC-ready event-driven alternative for customers who cannot wait for native S3AP notifications
Regulated workload owners get a clear event-durability boundary: near-real-time by default, Persistent Store required when event loss is unacceptable
SAP / ERP teams get a pattern for connecting peripheral files (IDoc, HULFT, batch output) to AWS AI/analytics without changing existing file interfaces

Conclusion

Phase 10 solves the problem that has been deferred since Phase 1: how do you get event-driven processing from FSx for ONTAP when S3AP native notifications don't exist?

The answer is ONTAP FPolicy — a mature notification framework that predates S3 Access Points by over a decade. By connecting it to ECS Fargate → SQS → EventBridge, Phase 10 established the shared event-ingestion pipeline and the TriggerMode parameter foundation needed to support polling, event-driven, and hybrid modes. UC-specific dispatch remains the main Phase 11 focus. The default remains POLLING, so existing deployments are unaffected.

The E2E verification confirmed that NFSv3, NFSv4.0, NFSv4.1 (ONTAP 9.15.1+), and SMB all work. NFSv4.2 does not — and the most common failure mode is mount -o vers=4 silently negotiating to 4.2. This is now documented and the setup guide recommends explicit version pinning.

Beyond FPolicy, Phase 10 matures the operational model: StackSets deployment compatibility for multi-account distribution, alarm profiles for workload-appropriate monitoring, and cost scheduling for environments that don't need 24/7 polling. Combined with the 6-validator CI pipeline and 1044+ passing tests, the pattern library is ready for production-style multi-account template distribution, while runtime cross-account data-path validation remains Phase 11+ work.

Design Guides

The following design guides have been added to the repository:

Document	Description
S3AP Authorization Model	Dual-layer authorization (IAM + file system)
Deployment Profiles	PoC / Production / Compliance-sensitive
Trigger Mode Decision Guide	POLLING / EVENT_DRIVEN / HYBRID
Enterprise Workload Examples	SAP, EDI, audit, batch output
S3AP Performance	Throughput, Lambda sizing, concurrency
Native Notifications Evidence	Feature request evidence
Partner/SI Delivery Checklist	Partner/SI proposal and delivery guide

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Update Note

This article describes the Phase 10 baseline — the first verified shared FPolicy ingestion pipeline. The event-driven pipeline is expanded across all 17 UCs in Phase 11 and operationally hardened in Phase 12 with Persistent Store replay validation, SLO observability, capacity guardrails, and secrets rotation. Phase 13 adds FlexClone/FlexCache serverless automation.

Use the FPolicy event-driven mode for PoC and near-real-time ingestion. For regulated or compliance-sensitive workloads, evaluate Persistent Store, replay handling, deduplication, and operational runbooks before treating the pipeline as durable. See Deployment Profiles for guidance.

Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9

Next phases: Phase 11 (UC-specific dispatch) · Phase 12 (Persistent Store replay + SLO hardening) · Phase 13 (FlexClone/FlexCache automation)

DEV Community: Yoshiki Fujiwara(藤原 善基)@AWS Community Builder

Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points

TL;DR

What Is Verified in This Article

Why This Matters

Use Cases This Unlocks

Workload Isolation Guidance

Operational Impact Validation

What This Means for Production

Architecture

Prerequisites

Step 1: Create the S3 Access Point

Step 2: Set the Access Point Policy

Step 3: Upload Test Data via NFS

Step 4: Verify S3 AP Access

Step 5: Create Glue Database and Table

Step 6: Query with Athena

Basic aggregation

Verified result

Observed Behavior: Query Results Written to the FSx S3 Access Point

Performance Characteristics

S3 API Compatibility Boundary

Compatibility Matrix

Governance and Compliance Considerations

Production Controls Checklist

30-Minute Validation Flow

First Success Path

PoC Success Criteria

Performance Test Plan

Validation Artifacts

What's Next

References

Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway

TL;DR

Why Direct Send (Without OTel Collector)?

Architecture

Key Discovery: OTLP Gateway, Not Loki Push API

Authentication: Basic Auth with base64 Encoding

The Three Lambda Handlers

1. FSx Audit Log Handler via S3 Access Point (handler.py)

2. EMS Webhook Handler (ems_handler.py)

3. FPolicy Handler (fpolicy_handler.py)

CloudFormation: Three Templates, Zero Hardcoded Values

Trigger Model: EventBridge Scheduler Polling

Checkpoint Semantics

Avoid Overlapping Poller Runs

Processing Bounds

S3 API Compatibility Boundary

Minimum Read-Path Permissions

First Success Path

One-Command Deploy and Cleanup

LogQL Query Examples

Dashboard: 4 Panels for Storage Observability

Alerting: Ransomware Detection and Security Monitoring

Scheduler DLQ Replay

Lessons Learned

Verified Query Matrix

Production and PoC Resources

What's Next

Series Navigation

Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.

TL;DR

What We're Building

The Problem: Vendor-Specific APIs = Lock-in

The Solution: OTLP as the Producer-to-Collector Contract

Prerequisites

The OTel Collector Configuration

Section Breakdown

Adding Datadog as a Third Backend

The Lambda Handler (OTLP Shipper)

Key Design Decisions

Field Mapping: FSx ONTAP → OTLP Attributes

Severity Determination Logic

OTLP Payload Construction

Retry with Exponential Backoff

AUTH_MODE Support

Deployment

Local Development: Docker Run

First Success Path

AWS Deployment: CloudFormation

DEV Community: Yoshiki Fujiwara(藤原善基)@AWS Community Builder

1. FSx Audit Log Handler via S3 Access Point (`handler.py`)

2. EMS Webhook Handler (`ems_handler.py`)

3. FPolicy Handler (`fpolicy_handler.py`)