<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</title>
    <description>The latest articles on DEV Community by Yoshiki Fujiwara(藤原 善基)@AWS Community Builder (@yoshikifujiwara).</description>
    <link>https://dev.to/yoshikifujiwara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1143688%2F2e0886ff-292c-4e8f-a588-bc7629c2321b.jpeg</url>
      <title>DEV Community: Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</title>
      <link>https://dev.to/yoshikifujiwara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yoshikifujiwara"/>
    <language>en</language>
    <item>
      <title>Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Fri, 22 May 2026 08:28:29 +0000</pubDate>
      <link>https://dev.to/aws-builders/query-nas-data-in-place-with-athena-and-fsx-for-ontap-s3-access-points-3lhh</link>
      <guid>https://dev.to/aws-builders/query-nas-data-in-place-with-athena-and-fsx-for-ontap-s3-access-points-3lhh</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;You can query files stored on Amazon FSx for NetApp ONTAP directly from Amazon Athena through an FSx-attached S3 Access Point — without copying the source data to an S3 bucket. The source files remain on the FSx for ONTAP volume and are accessed through S3 object APIs.&lt;/p&gt;

&lt;p&gt;I verified this end-to-end: Parquet files written via NFS are immediately queryable from Athena using the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;official AWS tutorial pattern&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is Part 1 of a series exploring how FSx for ONTAP S3 Access Points integrate with various Lakehouse platforms. Part 2 covers Databricks — where platform security boundaries make things significantly more complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations" rel="noopener noreferrer"&gt;fsxn-lakehouse-integrations&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to reproduce this validation, start from the repository's &lt;code&gt;integrations/athena/&lt;/code&gt; directory, which contains CloudFormation templates, sample data generators, and query scripts.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Verified in This Article
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Verified:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NFS-written Parquet file is visible via FSx S3 AP (&lt;code&gt;ListObjectsV2&lt;/code&gt;, &lt;code&gt;StorageClass: FSX_ONTAP&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Athena can query the file through Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Standard S3 bucket result location works as the documented pattern&lt;/li&gt;
&lt;li&gt;Experimental FSx S3 AP result output worked in my environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not verified:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delta / Hudi / Iceberg writes&lt;/li&gt;
&lt;li&gt;CTAS production pattern to FSx S3 AP&lt;/li&gt;
&lt;li&gt;S3 bucket event notification semantics&lt;/li&gt;
&lt;li&gt;Large-scale performance limits&lt;/li&gt;
&lt;li&gt;CloudTrail data event coverage (audit evidence approach should be validated per environment)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Enterprise file servers hold massive amounts of data — design files, inspection images, research documents, log archives. Traditionally, to analyze this data with cloud-native tools like Athena, you had to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy data from NFS/SMB to S3 (DataSync, scripts, etc.)&lt;/li&gt;
&lt;li&gt;Maintain sync pipelines&lt;/li&gt;
&lt;li&gt;Pay for duplicate storage&lt;/li&gt;
&lt;li&gt;Deal with stale data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;FSx for ONTAP S3 Access Points&lt;/strong&gt; (launched December 2025) change this. The same volume that serves NFS/SMB clients now exposes an S3-compatible API. Athena queries hit the same bytes that your NFS clients read — no copy required for the source dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Users (NFS/SMB)                    Athena (S3 API)
      │                                  │
      ▼                                  ▼
┌─────────────────────────────────────────────┐
│         FSx for ONTAP Volume                │
│         /analytics/sensor_data.parquet      │
│         /analytics/logs/*.json              │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Use Cases This Unlocks
&lt;/h2&gt;

&lt;p&gt;This pattern is useful when enterprise data already lives on NFS/SMB file shares and analytics teams want to query it without building a copy pipeline to S3.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manufacturing&lt;/strong&gt;: Sensor logs, inspection results, quality reports produced by factory systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAP / ERP&lt;/strong&gt;: Batch export files, operational reports, reconciliation extracts, and analytics copies — not direct replacement for application-native persistence or HA design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial services&lt;/strong&gt;: Reconciliation files, transaction logs, regulatory extracts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare research&lt;/strong&gt;: De-identified datasets, imaging metadata, study outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EDA / Semiconductor&lt;/strong&gt;: Design artifacts, simulation outputs, verification logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise file services&lt;/strong&gt;: Archives for compliance analysis, audit evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mission-critical workload note&lt;/strong&gt;&lt;br&gt;
This pattern provides an analytics read-access layer for existing file data. It does not replace workload-specific HA, backup, Snapshot, SnapMirror, or DR designs. For SAP, databases, VDI, and enterprise file services, treat Athena-on-FSx as an analytics and evidence layer, not as the primary resilience architecture.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Workload Isolation Guidance
&lt;/h2&gt;

&lt;p&gt;For mission-critical workloads, do not point exploratory analytics directly at the same directory used by latency-sensitive application writes unless the operational impact has been tested.&lt;/p&gt;

&lt;p&gt;Recommended pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application-owned path&lt;/strong&gt;: &lt;code&gt;/prod/app-output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics landing path&lt;/strong&gt;: &lt;code&gt;/analytics/curated/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Athena query result path&lt;/strong&gt;: Standard S3 bucket (conservative), or a separately validated output path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot / backup policy&lt;/strong&gt;: Owned by the workload team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue/Athena access&lt;/strong&gt;: Owned by the analytics platform team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For SAP, database exports, or ERP file drops, treat this pattern as a read-access analytics layer. Do not change application HA, backup, restore, or DR design just because the files are queryable through S3 APIs.&lt;/p&gt;

&lt;p&gt;In this context, an analytics copy means an application-produced or batch-exported file that is safe for downstream analytics, not the primary application persistence path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Operational Impact Validation
&lt;/h3&gt;

&lt;p&gt;Before production use, validate operational impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Baseline NFS/SMB workload latency and throughput before enabling analytics queries&lt;/li&gt;
&lt;li&gt;Athena query behavior during normal application write activity&lt;/li&gt;
&lt;li&gt;FSx provisioned throughput utilization during scans (analytics and application workloads share the same backend throughput)&lt;/li&gt;
&lt;li&gt;Query concurrency limits for the analytics team&lt;/li&gt;
&lt;li&gt;Rollback plan if analytics workload affects application workload&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended metrics include FSx throughput utilization, client-side NFS/SMB latency, Athena query runtime, bytes scanned, and application-side error or timeout rates during query execution.&lt;/p&gt;

&lt;p&gt;Rollback plan examples include disabling the Athena workgroup, revoking the S3 Access Point policy for analytics roles, reducing analytics query concurrency, or moving analytics to an isolated curated path.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Means for Production
&lt;/h3&gt;

&lt;p&gt;For production, treat this as a shared-storage analytics access pattern. The value is eliminating source data copy; the responsibility is validating workload isolation, throughput impact, governance, and rollback.&lt;/p&gt;

&lt;p&gt;This article is not a production certification. It is intended to start a production readiness discussion around workload isolation, governance, and rollback.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│  AWS Account                                                    │
│                                                                 │
│  ┌──────────────┐     ┌──────────────┐     ┌────────────────┐   │
│  │ FSx for ONTAP│     │ S3 Access    │     │ Athena         │   │
│  │ Volume       │◄────│ Point        │◄────│ (Serverless)   │   │
│  │              │     │ (Internet    │     │                │   │
│  │ /analytics/  │     │  origin)     │     │ SELECT ...     │   │
│  └──────────────┘     └──────────────┘     │ FROM table     │   │
│        ▲                     ▲             └────────────────┘   │
│        │                     │                      │           │
│   NFS/SMB clients      Glue Crawler          Query results      │
│   (write data)         (schema discovery)    (→ S3 bucket)      │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The access point must use &lt;strong&gt;Internet network origin&lt;/strong&gt;. Athena accesses S3 from managed infrastructure outside your VPC. The &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;AWS tutorial&lt;/a&gt; requires internet network origin for this path. VPC-origin access points deny requests from Athena.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue Data Catalog&lt;/strong&gt; provides the schema layer between Athena and the S3 AP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query results&lt;/strong&gt; are written to an S3 bucket (the standard Athena pattern), not back to the FSx volume. See Observed Behavior for an experimental alternative.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;FSx for ONTAP file system (ONTAP 9.17.1+)&lt;/li&gt;
&lt;li&gt;A volume with data (Parquet, CSV, JSON, etc.)&lt;/li&gt;
&lt;li&gt;S3 Access Point created with &lt;strong&gt;Internet network origin&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;An Athena workgroup with a query results location (standard S3 bucket)&lt;/li&gt;
&lt;li&gt;IAM permissions for Athena, Glue, and S3 AP access&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Create the S3 Access Point
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fsx create-and-attach-s3-access-point &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-analytics-ap &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; ONTAP &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ontap-configuration&lt;/span&gt; &lt;span class="s1"&gt;'{
    "VolumeId": "&amp;lt;YOUR_VOLUME_ID&amp;gt;",
    "FileSystemIdentity": {
      "Type": "UNIX",
      "UnixUser": {"Name": "fsxn_athena_reader"}
    }
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for the lifecycle to become &lt;code&gt;AVAILABLE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fsx describe-s3-access-point-attachments &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;volume-id,Values&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_VOLUME_ID&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'S3AccessPointAttachments[].{Name:Name,Lifecycle:Lifecycle,Alias:S3AccessPoint.Alias}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-analytics-ap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Lifecycle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AVAILABLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Alias"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-analytics-ap-xxxxxxxxxxxxxxxxxxxxxxxxxxxx-ext-s3alias"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The alias ending in &lt;code&gt;-ext-s3alias&lt;/code&gt; identifies this as an FSx for ONTAP S3 Access Point (as opposed to regular S3 Access Points which end in &lt;code&gt;-s3alias&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security note for file-system identity&lt;/strong&gt;&lt;br&gt;
This walkthrough uses a dedicated read-only identity (&lt;code&gt;fsxn_athena_reader&lt;/code&gt;). Make sure the corresponding UNIX/Windows permissions allow read access to the analytics path. Avoid using &lt;code&gt;root&lt;/code&gt; in production — scope the identity to the minimum permissions required.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 2: Set the Access Point Policy
&lt;/h2&gt;

&lt;p&gt;This walkthrough uses role-based principals for Athena and Glue. Replace the placeholder role ARNs with the IAM roles used by your Athena workgroup and Glue crawler. Avoid account-wide principals in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3control put-access-point-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--account-id&lt;/span&gt; &amp;lt;YOUR_ACCOUNT_ID&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-analytics-ap &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy&lt;/span&gt; &lt;span class="s1"&gt;'{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "AllowAnalyticsRead",
      "Effect": "Allow",
      "Principal": {"AWS": [
        "arn:aws:iam::&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:role/&amp;lt;ATHENA_QUERY_ROLE&amp;gt;",
        "arn:aws:iam::&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:role/&amp;lt;GLUE_CRAWLER_ROLE&amp;gt;"
      ]},
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:&amp;lt;YOUR_REGION&amp;gt;:&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:accesspoint/my-analytics-ap",
        "arn:aws:s3:&amp;lt;YOUR_REGION&amp;gt;:&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:accesspoint/my-analytics-ap/object/*"
      ]
    }]
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The policy above is the conservative &lt;strong&gt;read-only analytics&lt;/strong&gt; policy. If you intentionally test query result output to the FSx S3 Access Point (see Observed Behavior), add &lt;code&gt;s3:PutObject&lt;/code&gt; scoped to the experimental output prefix only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AllowExperimentalResultWrite"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:role/&amp;lt;ATHENA_QUERY_ROLE&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:PutObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:&amp;lt;YOUR_REGION&amp;gt;:&amp;lt;YOUR_ACCOUNT_ID&amp;gt;:accesspoint/my-analytics-ap/object/athena-results/*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Security note&lt;/strong&gt;: FSx for ONTAP S3 Access Points enforce S3 Block Public Access by default — this cannot be disabled. All requests require valid IAM credentials. Additionally, the file system user associated with the access point must have read permission on the files being queried.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy note&lt;/strong&gt;: The policy above is the minimum that worked in my validation. If your Glue crawler or Athena workgroup reports location-related access errors, compare the policy with the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;official tutorial&lt;/a&gt; and CloudTrail events, and add only the required actions.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 3: Upload Test Data via NFS
&lt;/h2&gt;

&lt;p&gt;On a machine with NFS access to the FSx volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Generate 10,000 rows of sensor data
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;periods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_D&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensor_E&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;humidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pressure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1013&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Write as Parquet to the NFS-mounted volume
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/fsxn/analytics/sensor-data/sensor_data.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Written &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memory_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; KB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same file is now accessible via both NFS (&lt;code&gt;/mnt/fsxn/analytics/sensor-data/sensor_data.parquet&lt;/code&gt;) and S3 API (&lt;code&gt;s3://&amp;lt;AP_ALIAS&amp;gt;/sensor-data/sensor_data.parquet&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Verify S3 AP Access
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3api list-objects-v2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$AP_ALIAS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prefix&lt;/span&gt; &lt;span class="s2"&gt;"sensor-data/"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Contents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sensor-data/sensor_data.parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;252858&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StorageClass"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FSX_ONTAP"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;StorageClass: FSX_ONTAP&lt;/code&gt; — this confirms the data lives on FSx, not S3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Create Glue Database and Table
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws glue create-database &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--database-input&lt;/span&gt; &lt;span class="s1"&gt;'{"Name": "fsxn_analytics"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can either run a Glue Crawler for automatic schema discovery (recommended by the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;AWS tutorial&lt;/a&gt;), or create the table manually via Athena:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;fsxn_analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sensor_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;humidity&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pressure&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://&amp;lt;AP_ALIAS&amp;gt;/sensor-data/'&lt;/span&gt;
&lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'parquet.compression'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'SNAPPY'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6: Query with Athena
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Basic aggregation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;sensor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;humidity&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_humidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'critical'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;critical_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;fsxn_analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensor_data&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensor_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;critical_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verified result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sensor_id | readings | avg_temp | avg_humidity | critical_count
----------|----------|----------|--------------|---------------
sensor_A  |    2027  |   24.89  |    59.84     |      68
sensor_B  |    1986  |   25.11  |    60.23     |      62
sensor_C  |    2013  |   24.95  |    59.91     |      59
sensor_D  |    1974  |   25.03  |    60.15     |      55
sensor_E  |    2000  |   24.98  |    60.02     |      56
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Query time: 1.46 seconds&lt;/strong&gt; | &lt;strong&gt;Data scanned: 67 KB&lt;/strong&gt; | &lt;strong&gt;Engine: Athena v3&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observed Behavior: Query Results Written to the FSx S3 Access Point
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;AWS tutorial&lt;/a&gt; states:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Athena reads data from your FSx for ONTAP volume through the access point. Athena query results are written to the Amazon S3 results bucket, not back to the FSx for ONTAP volume."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In my validation, however, setting &lt;code&gt;OutputLocation&lt;/code&gt; to the FSx for ONTAP S3 Access Point alias succeeded and wrote the &lt;code&gt;.csv&lt;/code&gt; and &lt;code&gt;.metadata&lt;/code&gt; files back to the FSx volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws athena start-query-execution &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query-string&lt;/span&gt; &lt;span class="s2"&gt;"SELECT 1 AS test"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--result-configuration&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"OutputLocation=s3://&amp;lt;AP_ALIAS&amp;gt;/athena-results/"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--work-group&lt;/span&gt; primary &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;YOUR_REGION&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result: SUCCEEDED in 584ms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The result files appeared on the FSx volume and were immediately accessible via NFS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat this as observed behavior from my environment, not a general production recommendation.&lt;/strong&gt; The conservative production pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source data&lt;/strong&gt;: FSx for ONTAP S3 Access Point&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Athena query results&lt;/strong&gt;: Standard S3 bucket (as documented)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The experimental pattern validated in this post:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source data&lt;/strong&gt;: FSx for ONTAP S3 Access Point&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Athena query results&lt;/strong&gt;: FSx for ONTAP S3 Access Point (observed to work, not documented)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Validate this in your own environment before relying on it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Governance warning&lt;/strong&gt;: Do not enable experimental query result output to FSx S3 AP for sensitive datasets unless query result retention, encryption, audit evidence, and file-system permissions are reviewed. Query results may contain derived sensitive information. For sensitive datasets, experimental result output should require approval from the data owner, security owner, and workload owner.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Performance Characteristics
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Observed&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple SELECT query&lt;/td&gt;
&lt;td&gt;584 ms&lt;/td&gt;
&lt;td&gt;Includes result write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation (10K rows, 67KB)&lt;/td&gt;
&lt;td&gt;1.46 s&lt;/td&gt;
&lt;td&gt;GROUP BY with 5 aggregations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data scan cost&lt;/td&gt;
&lt;td&gt;Standard Athena pricing&lt;/td&gt;
&lt;td&gt;$5 per TB scanned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage class&lt;/td&gt;
&lt;td&gt;FSX_ONTAP&lt;/td&gt;
&lt;td&gt;Confirmed in ListObjects&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Performance note&lt;/strong&gt;&lt;br&gt;
These numbers validate functional compatibility, not performance limits. The dataset is intentionally small (67 KB, 10K rows). For real analytics workloads, test with realistic file sizes, object counts, partition layouts, concurrent queries, and FSx provisioned throughput. The throughput available through the S3 API depends on the FSx file system's provisioned throughput capacity (&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-access-points.html" rel="noopener noreferrer"&gt;AWS documentation&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  S3 API Compatibility Boundary
&lt;/h2&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points expose file data through S3 object APIs, but they should not be treated as standard S3 buckets.&lt;/p&gt;

&lt;p&gt;The safe mental model is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use S3 APIs for object read/write access to files on FSx&lt;/li&gt;
&lt;li&gt;Use Glue and Athena for read-oriented analytics&lt;/li&gt;
&lt;li&gt;Do not assume S3 bucket-level features exist (event notifications, versioning, lifecycle policies)&lt;/li&gt;
&lt;li&gt;Do not assume lakehouse commit semantics (rename, conditional writes)&lt;/li&gt;
&lt;li&gt;Validate every platform integration separately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, the verified pattern is &lt;strong&gt;read-oriented analytics&lt;/strong&gt; over Parquet/CSV/JSON files. Transactional table formats and commit protocols are outside the safe default boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compatibility Matrix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Validated by legend:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This validation&lt;/strong&gt;: Actually executed commands or queries in this environment and confirmed the result&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supported operations review&lt;/strong&gt;: Confirmed based on the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-supported-operations.html" rel="noopener noreferrer"&gt;supported operations documentation&lt;/a&gt; or official tutorial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supported operations review required&lt;/strong&gt;: Not yet confirmed; additional validation needed before use&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Validated by&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ListObjectsV2&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;This validation&lt;/td&gt;
&lt;td&gt;S3 AP alias worked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GetObject (Parquet scan)&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;This validation&lt;/td&gt;
&lt;td&gt;Athena v3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PutObject (small result file)&lt;/td&gt;
&lt;td&gt;⚠️ Observed&lt;/td&gt;
&lt;td&gt;This validation&lt;/td&gt;
&lt;td&gt;Not documented as Athena result pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue table over S3 AP&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;This validation&lt;/td&gt;
&lt;td&gt;Manual DDL and Crawler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTAS to S3 AP&lt;/td&gt;
&lt;td&gt;❌ Failed in validation&lt;/td&gt;
&lt;td&gt;This validation&lt;/td&gt;
&lt;td&gt;Not part of the documented tutorial pattern; use standard S3 output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delta Lake writes&lt;/td&gt;
&lt;td&gt;❌ Not recommended&lt;/td&gt;
&lt;td&gt;Supported operations review&lt;/td&gt;
&lt;td&gt;Commit protocol depends on rename/atomic semantics not available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hudi/Iceberg writes&lt;/td&gt;
&lt;td&gt;❌ Not recommended&lt;/td&gt;
&lt;td&gt;Supported operations review&lt;/td&gt;
&lt;td&gt;Requires commit semantics beyond simple object read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 bucket event notifications&lt;/td&gt;
&lt;td&gt;❌ Not part of verified pattern&lt;/td&gt;
&lt;td&gt;Supported operations review required&lt;/td&gt;
&lt;td&gt;Do not assume bucket-level eventing; validate against &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-supported-operations.html" rel="noopener noreferrer"&gt;supported operations&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CTAS is a write-path pattern, not just a read query. Treat CTAS separately from read-oriented SELECT validation because it writes new table data to a target S3 location and may leave partial/orphaned files on failure. CTAS should not be included in the initial read-oriented validation scope.&lt;/p&gt;

&lt;p&gt;Transactional lakehouse formats may require semantics beyond simple object read/write, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Atomic commit behavior&lt;/li&gt;
&lt;li&gt;Rename or move-like commit operations&lt;/li&gt;
&lt;li&gt;Conditional writes (If-None-Match)&lt;/li&gt;
&lt;li&gt;Manifest consistency&lt;/li&gt;
&lt;li&gt;Concurrent writer coordination&lt;/li&gt;
&lt;li&gt;Cleanup of partial/orphaned files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article does not validate those semantics. It validates read-oriented analytics over existing files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Governance and Compliance Considerations
&lt;/h2&gt;

&lt;p&gt;This pattern keeps the source files on FSx for ONTAP, but it does not remove the need for data governance.&lt;/p&gt;

&lt;p&gt;Before using this pattern with regulated or sensitive datasets, review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data classification&lt;/strong&gt; of source files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM and S3 Access Point policy&lt;/strong&gt; scope (least privilege)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File system identity&lt;/strong&gt; mapped to the access point (UNIX/Windows user permissions apply)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue Data Catalog permissions&lt;/strong&gt; (who can see the table metadata)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Athena workgroup controls&lt;/strong&gt; (query limits, result encryption)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query result location and retention&lt;/strong&gt; (results may contain derived sensitive data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail / audit evidence&lt;/strong&gt; requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot, backup, retention, and deletion&lt;/strong&gt; policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Query results can be more sensitive than the original dataset because they may aggregate, filter, or derive new information. Apply encryption, retention, and access controls to the Athena result location as carefully as the source dataset.&lt;/p&gt;

&lt;p&gt;This article is a technical validation, not a compliance attestation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Controls Checklist
&lt;/h2&gt;

&lt;p&gt;For regulated or sensitive datasets, define the following before production use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Athena workgroup result location (standard S3 bucket)&lt;/li&gt;
&lt;li&gt;[ ] Whether workgroup settings override client-side result settings&lt;/li&gt;
&lt;li&gt;[ ] Query result encryption mode and KMS key ownership&lt;/li&gt;
&lt;li&gt;[ ] Query result retention and deletion policy&lt;/li&gt;
&lt;li&gt;[ ] IAM principals allowed to query the Glue table&lt;/li&gt;
&lt;li&gt;[ ] File-system identity mapped to the S3 Access Point (dedicated, not root)&lt;/li&gt;
&lt;li&gt;[ ] Audit evidence approach defined and validated (e.g., CloudTrail coverage for the S3 Access Point where applicable, with sample events captured as PoC evidence)&lt;/li&gt;
&lt;li&gt;[ ] Approval process for enabling experimental result output to FSx S3 AP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For regulated workloads, consider enabling Athena workgroup override so that query result location and encryption cannot be changed by client-side settings. This prevents individual clients from changing where query results are written or how they are encrypted.&lt;/p&gt;

&lt;p&gt;For regulated workloads, experimental writeback should be disabled by default and enabled only after explicit approval from the data owner, security owner, and workload owner.&lt;/p&gt;

&lt;p&gt;Experimental writeback may be enabled only when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Approval scope is documented&lt;/li&gt;
&lt;li&gt;Output path is isolated from source data&lt;/li&gt;
&lt;li&gt;Encryption and retention are defined for the output path&lt;/li&gt;
&lt;li&gt;Cleanup and rollback procedures are documented&lt;/li&gt;
&lt;li&gt;Review expiration date is set&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Minimum audit evidence artifacts for PoC completion:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scope statement: what the audit evidence demonstrates and what it does not (e.g., "validates access path and query result control for PoC scope; does not demonstrate full production compliance")&lt;/li&gt;
&lt;li&gt;Access path description (IAM → AP policy → file-system identity)&lt;/li&gt;
&lt;li&gt;Sample successful read event&lt;/li&gt;
&lt;li&gt;Sample denied access event (if applicable)&lt;/li&gt;
&lt;li&gt;Query result location configuration&lt;/li&gt;
&lt;li&gt;Encryption configuration&lt;/li&gt;
&lt;li&gt;Workgroup override setting (if used)&lt;/li&gt;
&lt;li&gt;Reviewer sign-off (name, role, date, decision)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  30-Minute Validation Flow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create or verify the FSx S3 Access Point (&lt;code&gt;AVAILABLE&lt;/code&gt; lifecycle)&lt;/li&gt;
&lt;li&gt;Write one Parquet file through NFS to the analytics path&lt;/li&gt;
&lt;li&gt;Confirm &lt;code&gt;StorageClass: FSX_ONTAP&lt;/code&gt; with &lt;code&gt;list-objects-v2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create the Glue table (manual DDL or crawler)&lt;/li&gt;
&lt;li&gt;Run one Athena query&lt;/li&gt;
&lt;li&gt;Capture the validation artifacts (see below)&lt;/li&gt;
&lt;li&gt;Decide Go / No-Go using the PoC Success Criteria&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  First Success Path
&lt;/h2&gt;

&lt;p&gt;If you are validating this for the first time, keep the scope small.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expected outcome:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One Parquet file written through NFS is visible through the S3 Access Point&lt;/li&gt;
&lt;li&gt;Glue table creation or crawler schema discovery succeeds&lt;/li&gt;
&lt;li&gt;Athena can query the file in place&lt;/li&gt;
&lt;li&gt;Query result location behavior is validated and documented&lt;/li&gt;
&lt;li&gt;NFS/SMB clients can still access the original file&lt;/li&gt;
&lt;li&gt;IAM and file-system identity boundaries are understood&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not start with Delta Lake, Hudi, Iceberg writes, large scans, or concurrent workloads. Prove the read path first.&lt;/p&gt;




&lt;h2&gt;
  
  
  PoC Success Criteria
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Minimum success:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 Access Point attachment is &lt;code&gt;AVAILABLE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ListObjectsV2&lt;/code&gt; returns the expected test file&lt;/li&gt;
&lt;li&gt;Glue table points to the S3 AP alias&lt;/li&gt;
&lt;li&gt;Athena query succeeds and returns correct results&lt;/li&gt;
&lt;li&gt;Results are reproducible from a clean workgroup/session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Operational success:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM role and S3 AP policy are scoped to the analytics roles&lt;/li&gt;
&lt;li&gt;Athena workgroup controls are defined&lt;/li&gt;
&lt;li&gt;Query result location and retention are documented&lt;/li&gt;
&lt;li&gt;Dataset size and scan cost are measured&lt;/li&gt;
&lt;li&gt;FSx throughput impact is measured during query&lt;/li&gt;
&lt;li&gt;Existing NFS/SMB application workload impact is measured during Athena queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Go / No-Go criteria:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Go&lt;/strong&gt;: Read-only analytics on Parquet/CSV/JSON works with acceptable latency and cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-Go&lt;/strong&gt;: Workload requires Delta/Hudi/Iceberg write commits through the S3 AP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-Go&lt;/strong&gt;: Platform governance requires Unity Catalog external locations and the platform cannot yet authorize the S3 AP (see Part 2)
&amp;lt;!-- TODO: Replace with actual Part 2 URL after publication --&amp;gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Test Plan
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This section defines the performance test plan and metrics to collect. It does not present benchmark results. Actual benchmark outputs will be added under &lt;code&gt;verification-pack/&lt;/code&gt; after validation runs are completed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next validation should include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 GB / 10 GB / 100 GB datasets&lt;/li&gt;
&lt;li&gt;Many small files vs fewer large Parquet files&lt;/li&gt;
&lt;li&gt;Partitioned layout (&lt;code&gt;date=YYYY-MM-DD/sensor_id=...&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Concurrent Athena queries&lt;/li&gt;
&lt;li&gt;Different FSx throughput capacity settings (128 / 256 / 512+ MBps)&lt;/li&gt;
&lt;li&gt;NFS writer activity during Athena scans&lt;/li&gt;
&lt;li&gt;Standard S3 result bucket vs observed FSx S3 AP result output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to separate Athena scan behavior, Glue metadata behavior, and FSx provisioned-throughput impact.&lt;/p&gt;

&lt;p&gt;Additional request pattern considerations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential vs parallel S3 API reads&lt;/li&gt;
&lt;li&gt;Prefix layout impact on listing performance&lt;/li&gt;
&lt;li&gt;Small object listing overhead&lt;/li&gt;
&lt;li&gt;Repeated query behavior with warm Glue/Athena metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Metrics collection sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FSx metrics: CloudWatch (FSx namespace)&lt;/li&gt;
&lt;li&gt;Athena query metrics: &lt;code&gt;get-query-execution&lt;/code&gt; API (EngineExecutionTimeInMillis, DataScannedInBytes)&lt;/li&gt;
&lt;li&gt;Client-side latency: CLI timing or SDK instrumentation&lt;/li&gt;
&lt;li&gt;Error/timeout sources: Athena query execution status and failure reason, client-side logs, application-side timeout logs, CloudTrail events where applicable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Record results separately for cold run (1+), warm metadata run (1+), repeated run (3+ executions). Report average, min, max, and notable outliers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Validation Artifacts
&lt;/h2&gt;

&lt;p&gt;For reproducibility, capture the following artifacts in your PoC:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 Access Point attachment lifecycle output (&lt;code&gt;describe-s3-access-point-attachments&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;list-objects-v2&lt;/code&gt; output showing &lt;code&gt;StorageClass: FSX_ONTAP&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Glue table DDL or crawler output&lt;/li&gt;
&lt;li&gt;Athena query execution ID&lt;/li&gt;
&lt;li&gt;Athena query runtime and scanned bytes&lt;/li&gt;
&lt;li&gt;Query result location and file listing&lt;/li&gt;
&lt;li&gt;NFS listing showing the original source file is unchanged&lt;/li&gt;
&lt;li&gt;IAM policy and access point policy used for the test&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, I'll cover what happens when you try to connect &lt;strong&gt;Databricks&lt;/strong&gt; to FSx for ONTAP S3 Access Points — where Unity Catalog's session policy, seccomp filters, and platform security boundaries create a significantly more complex picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-fsx-netapp-ontap-s3-access/" rel="noopener noreferrer"&gt;AWS What's New: Amazon FSx for NetApp ONTAP now supports Amazon S3 access (Dec 2, 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;AWS Tutorial: Query files with Athena&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-access-points.html" rel="noopener noreferrer"&gt;FSx for ONTAP S3 Access Points documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-supported-operations.html" rel="noopener noreferrer"&gt;Supported S3 operations for FSx S3 AP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations" rel="noopener noreferrer"&gt;GitHub: fsxn-lakehouse-integrations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) in May 2026.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Scope reminder&lt;/strong&gt;: This article verifies a limited read-oriented scenario. It does not validate production readiness, write-path behavior, distributed executor-scale processing, or all third-party analytics engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Article update plan&lt;/strong&gt;: v1.0 (current) — Scope, observed behavior, validation plan. Future updates: v1.1 — Benchmark results with realistic datasets. v1.2 — Security Verified candidate review. v1.3 — Production workload isolation test results.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>aws</category>
      <category>amazonfsxfornetappontap</category>
      <category>athena</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Thu, 21 May 2026 09:12:09 +0000</pubDate>
      <link>https://dev.to/aws-builders/direct-to-grafana-shipping-fsx-for-ontap-logs-to-grafana-cloud-loki-via-otlp-gateway-33hk</link>
      <guid>https://dev.to/aws-builders/direct-to-grafana-shipping-fsx-for-ontap-logs-to-grafana-cloud-loki-via-otlp-gateway-33hk</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We built a direct Lambda-to-Grafana Cloud pipeline that ships FSx for ONTAP audit logs to Loki without an intermediate OTel Collector. Three Lambda functions cover all event sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP audit logs&lt;/strong&gt; → EventBridge Scheduler (every 5 min) → Lambda (polls &amp;amp; reads via S3 Access Point) → OTLP Gateway → Loki&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EMS webhooks&lt;/strong&gt; (ransomware alerts, quota warnings) → API Gateway → Lambda → OTLP Gateway → Loki&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy file operations&lt;/strong&gt; (real-time CIFS/SMB events) → ECS Fargate → SQS → Bridge Lambda → EventBridge → Lambda → OTLP Gateway → Loki&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is CloudFormation-templated, parameterized, and deployable with a single script. No hardcoded values, and the infrastructure is fully parameterized. This is a Grafana-specific direct integration by design; use the Collector path from Part 5 when you need backend portability.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you only want to validate the path quickly, jump to First Success Path and deploy the audit poller first.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;This is the single-backend counterpart to Part 5: simpler when Grafana Cloud is the chosen destination, less flexible when backend portability, enrichment, redaction, or multi-backend routing is required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Direct Send (Without OTel Collector)?
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap"&gt;Part 5&lt;/a&gt;, we showed how the OTel Collector decouples Lambda from backends. That's the right choice when you need multi-backend delivery or vendor migration flexibility.&lt;/p&gt;

&lt;p&gt;But if Grafana Cloud is your single observability platform and your goal is a simple serverless path, direct OTLP can be a good starting point. For production pipelines that need richer buffering, metadata enrichment, redaction, or routing, &lt;a href="https://grafana.com/docs/grafana-cloud/send-data/otlp/send-data-otlp/" rel="noopener noreferrer"&gt;Grafana recommends an Alloy / Collector-based architecture&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Components&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OTel Collector&lt;/td&gt;
&lt;td&gt;Lambda → Collector (ECS/EC2) → Grafana&lt;/td&gt;
&lt;td&gt;+50-100ms&lt;/td&gt;
&lt;td&gt;Collector compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct send&lt;/td&gt;
&lt;td&gt;Lambda → Grafana OTLP Gateway&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Lambda only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The direct path is simpler, cheaper, and has fewer failure points. You can always graduate to the Collector path later (Part 5 shows how). Direct send is a good fit when operational simplicity is more important than in-pipeline enrichment, redaction, buffering, and multi-backend routing. If those requirements become mandatory, move the same OTLP payload model behind Alloy or the OpenTelemetry Collector.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Direct send reduces moving parts, but it also removes the Collector / Alloy queueing layer. For production, decide whether Lambda retry and DLQ are sufficient, or whether you need SQS buffering, DLQ replay, or the Collector / Alloy path for stronger delivery guarantees during endpoint outages or throttling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delivery guarantee decision&lt;/strong&gt; (see &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/delivery-guarantees.md" rel="noopener noreferrer"&gt;full pattern guide&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Quickstart (this template)&lt;/em&gt;: Scheduler retry + Scheduler DLQ + Lambda reserved concurrency + checkpoint retry&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Medium volume&lt;/em&gt;: add Lambda failure destination and operational replay procedures&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Higher reliability&lt;/em&gt;: insert SQS before shipping, or place Alloy / OTel Collector behind Lambda for batching, retry with persistent queue, transform, redaction, and multi-backend routing&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Multi-backend or redaction/routing&lt;/em&gt;: use Part 5 Collector path&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│ Event Sources                                        │
├─────────────────────────────────────────────────────┤
│                                                      │
│  EventBridge Scheduler                               │
│  rate(5 minutes) ──→ Lambda                          │
│                       │ lists new files via           │
│                       │ S3 Access Point              │
│                       │ (checkpoint in SSM)          │
│                       ▼                              │
│                OTLP Gateway                          │
│                (Grafana Cloud)                        │
│                       │                              │
│  EMS Webhook          │                              │
│  ──→ API GW ──→ Lambda ────────────┤                │
│     (ems_handler)                   │                │
│                                     ▼                │
│  FPolicy                           Loki             │
│  ──→ ECS Fargate ──→ SQS          (Explore,        │
│  ──→ Bridge Lambda                  Dashboard)      │
│  ──→ EventBridge                                    │
│  ──→ Lambda (fpolicy_handler) ─────────────────────┤
└─────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The audit log path uses a &lt;strong&gt;polling pattern&lt;/strong&gt;: EventBridge Scheduler invokes Lambda every 5 minutes. Lambda lists new objects via the S3 Access Point, reads and processes them, then updates an SSM Parameter Store checkpoint to track progress. This avoids reliance on S3 Event Notifications, which are not supported by FSx for ONTAP S3 Access Points.&lt;/p&gt;

&lt;p&gt;The same S3 Access Point boundary can be reused for other automation patterns (AI/ML, analytics, compliance archival) because the audit files remain on FSx for ONTAP while Lambda reads them through standard S3 object APIs — no data copy or NFS/SMB mount required.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This pattern does not replace ONTAP audit, EMS, or FPolicy configuration; it provides an AWS-native delivery and visualization layer for those ONTAP-native signals.&lt;/p&gt;

&lt;p&gt;For business-critical workloads such as SAP, databases, VDI, or enterprise file services, treat this pipeline as an observability and evidence layer. It complements, but does not replace, workload-specific HA, backup, restore, and DR designs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Use cases this unlocks&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Investigate file access activity for FSx for ONTAP-hosted enterprise file shares&lt;/li&gt;
&lt;li&gt;Monitor available ONTAP EMS alerts, such as ransomware-related events, quota warnings, and storage/system events&lt;/li&gt;
&lt;li&gt;Correlate audit logs, EMS, and FPolicy file operations in a single Grafana dashboard&lt;/li&gt;
&lt;li&gt;Provide a lightweight observability path for SAP, database, VDI, and file service workloads using FSx for ONTAP&lt;/li&gt;
&lt;li&gt;Start with direct OTLP delivery and graduate to Alloy / Collector when governance or multi-backend routing is required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The FPolicy path has two Lambda roles: a &lt;strong&gt;bridge Lambda&lt;/strong&gt; that converts ECS/FPolicy server SQS output into EventBridge events, and &lt;strong&gt;&lt;code&gt;fpolicy_handler.py&lt;/code&gt;&lt;/strong&gt;, which ships those normalized EventBridge events to Grafana Cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Discovery: OTLP Gateway, Not Loki Push API
&lt;/h2&gt;

&lt;p&gt;During E2E verification, the Loki Push API returned HTTP 530 in my trial account. The OTLP Gateway worked reliably in this project and is the &lt;a href="https://grafana.com/docs/grafana-cloud/send-data/otlp/send-data-otlp/" rel="noopener noreferrer"&gt;recommended Grafana Cloud OTLP ingestion path&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For logs, Grafana Cloud routes OTLP log data to Loki, where it becomes queryable with LogQL.&lt;/p&gt;

&lt;p&gt;Our Lambda auto-detects the endpoint mode from the URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_otlp_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Detect Grafana OTLP Gateway or generic OTLP/HTTP logs endpoint.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;otlp-gateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/otlp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/otlp/v1/logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;USE_OTLP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_is_otlp_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOKI_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When using the OTLP Gateway, configure &lt;code&gt;LOKI_ENDPOINT&lt;/code&gt; as the base OTLP endpoint ending in &lt;code&gt;/otlp&lt;/code&gt;. The Lambda appends &lt;code&gt;/v1/logs&lt;/code&gt; when sending logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Configure as base endpoint (Lambda appends /v1/logs)&lt;/span&gt;
&lt;span class="nv"&gt;LOKI_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
&lt;span class="c"&gt;# Lambda POSTs to: https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp/v1/logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The handler also accepts the full path (&lt;code&gt;/otlp/v1/logs&lt;/code&gt;) without double-appending.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;URL Pattern&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OTLP Gateway (preferred)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://otlp-gateway-prod-&amp;lt;region&amp;gt;.grafana.net/otlp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Recommended by Grafana Cloud docs; verified in this project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Loki Push API (fallback)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://logs-prod-&amp;lt;region&amp;gt;.grafana.net/loki/api/v1/push&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;⚠️ May behave differently by account state; returned 530 in my trial validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted Loki OTLP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://&amp;lt;loki-host&amp;gt;/otlp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Requires Loki OTLP ingestion support and structured metadata configuration; Loki 3.0+ enables structured metadata by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Authentication: Basic Auth with base64 Encoding
&lt;/h2&gt;

&lt;p&gt;Grafana Cloud uses Basic Auth for both endpoints. The critical detail: the value is &lt;code&gt;base64(instanceId:apiToken)&lt;/code&gt;, not plain text concatenation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;b64encode&lt;/span&gt;

&lt;span class="n"&gt;instance_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;123456&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# From Grafana Cloud console
&lt;/span&gt;&lt;span class="n"&gt;api_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glc_...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# logs:write scope
&lt;/span&gt;
&lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;api_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;auth_header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Credentials are stored in AWS Secrets Manager as JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"instance_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;id&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;token&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lambda reads this at cold start and caches the auth header for subsequent invocations. For production, use the shared &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/shared/python/auth_cache.py" rel="noopener noreferrer"&gt;&lt;code&gt;auth_cache.py&lt;/code&gt;&lt;/a&gt; module which provides TTL-based caching with automatic reload-on-401/403, so credential rotation does not require waiting for a new Lambda execution environment.&lt;/p&gt;

&lt;p&gt;Internally, normalized records are now converted directly to OTLP as the primary path. Loki Push formatting is kept only as a fallback mode. This aligns with Part 5's "OTLP as producer contract" principle. For the full OTLP resource/log-record/body mapping and &lt;code&gt;fsxn.*&lt;/code&gt; attribute naming policy, see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/operations.md#otlp-mapping" rel="noopener noreferrer"&gt;Grafana Operations Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Lambda Handlers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. FSx Audit Log Handler via S3 Access Point (&lt;code&gt;handler.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Polls for new FSx ONTAP audit log files via S3 Access Point, parses JSON/EVTX, and ships to Grafana Cloud. Uses SSM Parameter Store to checkpoint progress between invocations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;auth_header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_auth_header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Cached from Secrets Manager
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Polling mode: list new files, process, update checkpoint
&lt;/span&gt;        &lt;span class="n"&gt;last_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_checkpoint&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# SSM Parameter Store
&lt;/span&gt;        &lt;span class="n"&gt;new_keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list_new_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;S3_ACCESS_POINT_ARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                 &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MAX_KEYS_PER_RUN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_remaining_time_in_millis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;SAFETY_THRESHOLD_MS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# Stop early, resume on next scheduled run
&lt;/span&gt;            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_ACCESS_POINT_ARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;ship_to_grafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Raises on failure
&lt;/span&gt;            &lt;span class="nf"&gt;set_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Only after confirmed delivery
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Manual test mode using an S3-event-shaped payload
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;extract_s3_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_ACCESS_POINT_ARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="nf"&gt;ship_to_grafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;auth_header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query in Grafana Explore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{service_name="fsxn-audit"} | json | Operation="create"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. EMS Webhook Handler (&lt;code&gt;ems_handler.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Receives ONTAP EMS events via API Gateway, parses with the shared EMS parser layer, and forwards to Grafana.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_ems_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Shared Lambda Layer
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;USE_OTLP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;format_for_otlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;format_for_loki&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;ship_to_grafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Labels: &lt;code&gt;{service_name="fsxn-ems", source="ontap", severity="alert"}&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Security note&lt;/strong&gt;: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Use API Gateway authorization controls such as an API key, IAM authorization, Lambda authorizer, resource policy, WAF, or source IP restrictions based on your network design. The quickstart template uses &lt;code&gt;AuthorizationType: NONE&lt;/code&gt; for simplicity — add appropriate controls before production use. See the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/webhook-security.md" rel="noopener noreferrer"&gt;webhook security guide&lt;/a&gt; for a full comparison of auth modes and a recommended shared-secret Lambda authorizer pattern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. FPolicy Handler (&lt;code&gt;fpolicy_handler.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Subscribes to EventBridge events from the FPolicy ECS Fargate server and forwards file operation events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# EventBridge event
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;USE_OTLP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;format_for_otlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;format_for_loki&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;ship_to_grafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Labels: &lt;code&gt;{service_name="fsxn-fpolicy", source="ontap", operation="create"}&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  CloudFormation: Three Templates, Zero Hardcoded Values
&lt;/h2&gt;

&lt;p&gt;Each template is fully parameterized:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Template&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Parameters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;template.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FSx audit log poller Lambda&lt;/td&gt;
&lt;td&gt;S3AccessPointArn, GrafanaCredentialsSecretArn, LokiEndpoint, ScheduleExpression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;template-ems.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EMS webhook Lambda&lt;/td&gt;
&lt;td&gt;GrafanaCredentialsSecretArn, LokiEndpoint, EmsParserLayerArn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;template-fpolicy.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FPolicy EventBridge Lambda&lt;/td&gt;
&lt;td&gt;GrafanaCredentialsSecretArn, LokiEndpoint, EventBusName&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;LokiEndpoint&lt;/code&gt; parameter accepts both OTLP Gateway and Loki Push API URLs — the Lambda auto-detects the mode. The quickstart template also sets Lambda reserved concurrency to 1 and provisions a Scheduler DLQ with retry policy to avoid overlapping poller runs and preserve failed scheduled invocations. Processing bounds (&lt;code&gt;MAX_KEYS_PER_RUN&lt;/code&gt;, &lt;code&gt;SAFETY_THRESHOLD_MS&lt;/code&gt;) are configured via Lambda environment variables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger Model: EventBridge Scheduler Polling
&lt;/h3&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points do &lt;strong&gt;not&lt;/strong&gt; support S3 Event Notifications or EventBridge &lt;code&gt;ObjectCreated&lt;/code&gt; events. Instead, this integration uses an &lt;strong&gt;EventBridge Scheduler polling pattern&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EventBridge Scheduler&lt;/strong&gt; invokes the Lambda every 5 minutes (configurable via &lt;code&gt;ScheduleExpression&lt;/code&gt; parameter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda lists new files&lt;/strong&gt; via &lt;code&gt;ListObjectsV2&lt;/code&gt; on the S3 Access Point, using &lt;code&gt;StartAfter&lt;/code&gt; to skip already-processed keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda reads and processes&lt;/strong&gt; each new file, shipping logs to Grafana Cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint&lt;/strong&gt; (SSM Parameter Store) tracks the last successfully processed S3 key — on the next invocation, only newer files are processed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This pattern is simple, cost-effective, and works with AWS S3 API-compatible read paths such as FSx for ONTAP S3 Access Points. The trade-off is polling latency (up to 5 minutes by default) vs. the near-real-time delivery of event-driven triggers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CloudTrail alternative&lt;/strong&gt;: CloudTrail data events &lt;em&gt;do&lt;/em&gt; work with FSx ONTAP S3 Access Points (confirmed by NetApp Workload Factory's Journal table feature). However, CloudTrail data events add additional delivery latency and $0.10/100K events cost (in my validation, the CloudTrail-based path had 5–15 minutes of end-to-end delay), making the polling pattern the better default for this use case. See the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/cloudtrail-trigger-alternative.md" rel="noopener noreferrer"&gt;CloudTrail trigger alternative&lt;/a&gt; for a full analysis and CloudFormation example.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudFormation: EventBridge Scheduler with retry and DLQ&lt;/span&gt;
&lt;span class="na"&gt;AuditLogSchedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::Scheduler::Schedule&lt;/span&gt;
  &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ScheduleExpression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;ScheduleExpression&lt;/span&gt;  &lt;span class="c1"&gt;# default: rate(5 minutes)&lt;/span&gt;
    &lt;span class="na"&gt;FlexibleTimeWindow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;OFF'&lt;/span&gt;
    &lt;span class="na"&gt;Target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!GetAtt&lt;/span&gt; &lt;span class="s"&gt;LogShipperFunction.Arn&lt;/span&gt;
      &lt;span class="na"&gt;RoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!GetAtt&lt;/span&gt; &lt;span class="s"&gt;SchedulerRole.Arn&lt;/span&gt;
      &lt;span class="na"&gt;Input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Sub&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"source":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"scheduler",&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"s3_access_point_arn":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"${S3AccessPointArn}",&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"prefix":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"${S3KeyPrefix}"}'&lt;/span&gt;
      &lt;span class="na"&gt;RetryPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;MaximumRetryAttempts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;MaximumEventAgeInSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
      &lt;span class="na"&gt;DeadLetterConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!GetAtt&lt;/span&gt; &lt;span class="s"&gt;SchedulerDLQ.Arn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The handler also accepts S3 event format for manual testing via &lt;code&gt;aws lambda invoke&lt;/code&gt;, so you can still test individual files without waiting for the scheduler.&lt;/p&gt;

&lt;h3&gt;
  
  
  Checkpoint Semantics
&lt;/h3&gt;

&lt;p&gt;The quickstart uses a simple high-watermark checkpoint: the last successfully processed object key is stored in SSM Parameter Store, and the next run lists keys after that value.&lt;/p&gt;

&lt;p&gt;This works when audit log object keys are monotonically increasing and immutable. For production, validate your audit log naming and rotation behavior. If files can arrive late, be overwritten, or appear out of lexical order, use a stronger checkpoint model such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keeping a short lookback window&lt;/li&gt;
&lt;li&gt;Deduplicating by object key + ETag or LastModified&lt;/li&gt;
&lt;li&gt;Storing per-object processing state in DynamoDB&lt;/li&gt;
&lt;li&gt;Updating the checkpoint only after confirmed Grafana delivery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The checkpoint is advanced only after Grafana returns a successful response for that object. If delivery fails after retries, the Lambda raises an error and the next scheduled run will retry from the last checkpoint.&lt;/p&gt;

&lt;p&gt;Failure-path tests verify this behavior: if OTLP delivery returns failure after retries, the Lambda raises and the checkpoint does not advance past the failed object.&lt;/p&gt;

&lt;p&gt;Files that parse successfully but contain no shippable records are treated as successfully processed and checkpointed; only delivery failures or parse errors prevent checkpoint advancement.&lt;/p&gt;

&lt;p&gt;For production, add a poison-pill policy for files that repeatedly fail parsing or delivery; otherwise one bad file can block later audit logs when using a high-watermark checkpoint. See the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/operations.md" rel="noopener noreferrer"&gt;Grafana operations guide&lt;/a&gt; for poison-pill handling, pipeline health alarms, and custom metrics.&lt;/p&gt;

&lt;p&gt;Use SSM Parameter Store for the quickstart high-watermark checkpoint. Move to DynamoDB when you need per-object state, deduplication, replay tracking, or concurrent workers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Delivery semantics&lt;/strong&gt;: This pipeline provides at-least-once delivery, not exactly-once. If a Lambda invocation succeeds in sending logs to Grafana but fails before updating the checkpoint (e.g., timeout or transient SSM error), the next run will re-process and re-send those objects. For most observability use cases, duplicate log entries are acceptable. If deduplication is required, implement it explicitly using object key + ETag, event ID, or payload hash in DynamoDB. Do not rely on backend-side deduplication as the primary correctness mechanism.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Avoid Overlapping Poller Runs
&lt;/h3&gt;

&lt;p&gt;Because the audit-log poller is schedule-driven, overlapping Lambda executions can race on the same key range and checkpoint. The quickstart template sets &lt;code&gt;ReservedConcurrentExecutions: 1&lt;/code&gt; to prevent this.&lt;/p&gt;

&lt;p&gt;For higher-volume production pipelines, use a distributed lock (e.g., DynamoDB conditional write) and per-object processing state instead of relying on single-concurrency.&lt;/p&gt;

&lt;p&gt;The quickstart also configures EventBridge Scheduler with a retry policy (2 retries, 1-hour event age) and a dedicated DLQ. If a scheduled invocation is throttled or fails, the event is preserved in the Scheduler DLQ for visibility and replay.&lt;/p&gt;

&lt;p&gt;The quickstart uses 2 retries and 1-hour maximum event age to surface persistent failures quickly while avoiding unbounded retry storms. Increase these values only if your Grafana endpoint outage tolerance and duplicate-handling strategy are defined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Processing Bounds
&lt;/h3&gt;

&lt;p&gt;The poller bounds work per invocation to avoid timeout-related checkpoint corruption:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max keys per run&lt;/strong&gt; (&lt;code&gt;MAX_KEYS_PER_RUN&lt;/code&gt;, default: 100): caps the number of files processed in a single invocation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety threshold&lt;/strong&gt; (&lt;code&gt;SAFETY_THRESHOLD_MS&lt;/code&gt;, default: 30000): stops processing when remaining Lambda time falls below 30 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAX_KEYS_PER_RUN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum audit log files processed per invocation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SAFETY_THRESHOLD_MS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;30000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stop processing before Lambda timeout&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tune these values after observing Lambda duration, checkpoint age, Scheduler DLQ depth, FSx S3 Access Point read throughput, and Grafana send latency.&lt;/p&gt;

&lt;p&gt;Because the checkpoint advances after each successfully delivered object, the next scheduled run resumes safely from where the previous run stopped.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 API Compatibility Boundary
&lt;/h3&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points provide S3 object API access (GetObject, ListObjectsV2, etc.) to file data that remains on the FSx for ONTAP file system. They should not be assumed to have the same bucket-level features or eventing behavior as standard S3 buckets. In this integration, the important difference is eventing: the audit log path uses Scheduler polling instead of S3 Event Notifications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimum Read-Path Permissions
&lt;/h3&gt;

&lt;p&gt;For the audit-log Lambda, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s3:ListBucket&lt;/code&gt; on the S3 Access Point ARN&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3:GetObject&lt;/code&gt; on the S3 Access Point object ARN (&lt;code&gt;{arn}/object/*&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;S3 Access Point policy allows the Lambda execution role&lt;/li&gt;
&lt;li&gt;The file-system user associated with the access point has read permission on the audit log path&lt;/li&gt;
&lt;li&gt;If the access point is VPC-restricted, the Lambda network path can reach the S3 endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;IAM resource ARN examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# List access (s3:ListBucket)&lt;/span&gt;
&lt;span class="na"&gt;Resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;access-point-name&amp;gt;&lt;/span&gt;

&lt;span class="c1"&gt;# Object read (s3:GetObject)&lt;/span&gt;
&lt;span class="na"&gt;Resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;access-point-name&amp;gt;/object/*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  First Success Path
&lt;/h2&gt;

&lt;p&gt;If this is your first deployment, start small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy only the audit log poller&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MAX_KEYS_PER_RUN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SAFETY_THRESHOLD_MS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30000
bash integrations/grafana/scripts/deploy.sh &lt;span class="nt"&gt;--audit-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then validate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm &lt;code&gt;{service_name="fsxn-audit"}&lt;/code&gt; in Grafana Explore&lt;/li&gt;
&lt;li&gt;Check the Scheduler DLQ is empty&lt;/li&gt;
&lt;li&gt;Verify the SSM checkpoint advanced&lt;/li&gt;
&lt;li&gt;Create the dashboard&lt;/li&gt;
&lt;li&gt;Add EMS and FPolicy only after the audit path works (&lt;code&gt;deploy.sh --all&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;deploy.sh&lt;/code&gt; passes &lt;code&gt;MAX_KEYS_PER_RUN&lt;/code&gt; and &lt;code&gt;SAFETY_THRESHOLD_MS&lt;/code&gt; as Lambda environment variables. If unset, the template defaults (100 / 30000) are used.&lt;/p&gt;

&lt;p&gt;The first validation should prove three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One audit file is visible in Grafana (&lt;code&gt;{service_name="fsxn-audit"}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The SSM checkpoint advanced to the processed key&lt;/li&gt;
&lt;li&gt;The Scheduler DLQ remains empty&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  One-Command Deploy and Cleanup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy all 3 stacks + update Lambda code (default is --all)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GRAFANA_SECRET_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:ap-northeast-1:&amp;lt;account&amp;gt;:secret:grafana/fsxn-loki-credentials-XXXXXX"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;S3_ACCESS_POINT_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:&amp;lt;account&amp;gt;:accesspoint/fsxn-audit-ap"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LOKI_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp"&lt;/span&gt;

bash integrations/grafana/scripts/deploy.sh &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cleanup script removes CloudFormation stacks and optionally deletes synthetic test objects. It does &lt;strong&gt;not&lt;/strong&gt; delete production FSx audit files through the FSx-attached S3 Access Point — those remain on the FSx file system. Pass &lt;code&gt;--s3-bucket&lt;/code&gt; and &lt;code&gt;--s3-prefix&lt;/code&gt; only if you uploaded test data to a regular S3 bucket during validation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tear down everything (dependency-safe order)&lt;/span&gt;
bash integrations/grafana/scripts/cleanup.sh &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--s3-bucket&lt;/span&gt; your-bucket &lt;span class="nt"&gt;--s3-prefix&lt;/span&gt; audit/svm-prod-01/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cleanup script deletes stacks in dependency-safe order (API Gateway before Lambda) and handles DELETE_FAILED states gracefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  LogQL Query Examples
&lt;/h2&gt;

&lt;p&gt;High-cardinality fields such as &lt;code&gt;UserName&lt;/code&gt; and &lt;code&gt;ObjectName&lt;/code&gt; remain in the log body and are extracted at query time with &lt;code&gt;| json&lt;/code&gt;; they are intentionally not promoted to Loki labels to avoid index bloat and cost.&lt;/p&gt;

&lt;p&gt;Once logs arrive, Grafana Explore becomes your investigation tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# All audit logs
{service_name="fsxn-audit"}

# Filter by operation
{service_name="fsxn-audit"} | json | Operation="delete"

# Failed access attempts (security investigation)
{service_name="fsxn-audit"} | json | Result="Failure"

# EMS ransomware alerts
{service_name="fsxn-ems"} | json | event_name="arw.volume.state"

# FPolicy file operations
{service_name="fsxn-fpolicy"} | json | operation="create"

# Human-readable format
{service_name="fsxn-audit"} | json | line_format "{{.UserName}} {{.Operation}} {{.ObjectName}}"

# Log volume over time (for dashboards)
count_over_time({service_name="fsxn-audit"}[5m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dashboard: 4 Panels for Storage Observability
&lt;/h2&gt;

&lt;p&gt;The following panel queries are the exact queries generated by &lt;code&gt;scripts/create-dashboard.sh&lt;/code&gt; and verified against this project's OTLP-ingested log shape. The repository includes a dashboard creation script that provisions a Grafana dashboard via API with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log Volume&lt;/strong&gt; (Time series): &lt;code&gt;count_over_time({service_name="fsxn-audit"}[5m])&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations Breakdown&lt;/strong&gt; (Pie chart): &lt;code&gt;sum by (Operation) (count_over_time({service_name="fsxn-audit"} | json [1h]))&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Activity Top 10&lt;/strong&gt; (Bar gauge): &lt;code&gt;topk(10, sum by (UserName) (count_over_time({service_name="fsxn-audit"} | json [1h])))&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed Events&lt;/strong&gt; (Time series): &lt;code&gt;count_over_time({service_name="fsxn-audit"} | json | Result="Failure" [5m])&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Alerting: Ransomware Detection and Security Monitoring
&lt;/h2&gt;

&lt;p&gt;Beyond dashboards, the integration includes three Grafana alerting rules provisioned via &lt;code&gt;scripts/create-alerts.sh&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;The table below shows the alert conditions. The provisioning script wraps these into Grafana alert expressions using count/reduce/threshold steps.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alert&lt;/th&gt;
&lt;th&gt;Detection Query (alert condition)&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ransomware Detection (ARP)&lt;/td&gt;
&lt;td&gt;`count_over_time({service_name="fsxn-ems"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quota Soft Limit Exceeded&lt;/td&gt;
&lt;td&gt;{% raw %}`count_over_time({service_name="fsxn-ems"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed Access Spike&lt;/td&gt;
&lt;td&gt;{% raw %}`count_over_time({service_name="fsxn-audit"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rules use Grafana's unified alerting format and are deployed to a "FSxN Alerts" folder. Configure contact points (Slack, PagerDuty, email) and notification policies in the Grafana UI to route alerts by severity or team label. The rule definitions are available as {% raw %}&lt;code&gt;alerting/rules.yaml&lt;/code&gt;; see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/alerting/README.md" rel="noopener noreferrer"&gt;alerting README&lt;/a&gt; for provisioning details, no-data behavior, contact point caveats, and threshold tuning guidance.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;API compatibility&lt;/strong&gt;: This script uses Grafana's Alerting Provisioning HTTP API (&lt;code&gt;/api/v1/provisioning&lt;/code&gt;). Grafana 13+ introduces newer &lt;code&gt;/apis&lt;/code&gt; routes while legacy &lt;code&gt;/api&lt;/code&gt; routes remain available; check your Grafana Cloud version if provisioning fails. Provisioning alert rules does not automatically configure notification delivery — create or map contact points and notification policies before relying on these alerts for production response.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The sample rules treat "No data" as OK, because absence of matching ransomware, quota, or failed-access events is expected in normal operation. Query execution errors are routed as Error state for operator attention. These thresholds are starter defaults — tune them per SVM, workload, and normal user behavior before enabling production paging.&lt;/p&gt;

&lt;p&gt;For production, monitor the pipeline itself: Scheduler DLQ depth, Lambda errors/throttles/duration, checkpoint age, and Grafana send failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheduler DLQ Replay
&lt;/h2&gt;

&lt;p&gt;The Scheduler DLQ message is primarily an operational signal and replay payload. Because the poller uses a checkpoint, the next scheduled run may already retry the failed key range automatically.&lt;/p&gt;

&lt;p&gt;When a scheduled invocation fails and lands in the Scheduler DLQ:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inspect the DLQ message (contains the scheduler input payload)&lt;/li&gt;
&lt;li&gt;Check the current checkpoint in SSM Parameter Store&lt;/li&gt;
&lt;li&gt;Check whether a later scheduled run has already advanced the checkpoint and delivered the missed objects&lt;/li&gt;
&lt;li&gt;If the checkpoint has advanced and Grafana shows the data, the failure was auto-recovered — delete the DLQ message&lt;/li&gt;
&lt;li&gt;If the checkpoint has NOT advanced, the next scheduled run will retry automatically from the last checkpoint&lt;/li&gt;
&lt;li&gt;For manual replay (if auto-retry is insufficient): invoke the Lambda directly with the scheduler payload, then delete the DLQ message&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Before manually replaying a DLQ message, compare the DLQ payload with the current SSM checkpoint and Grafana ingestion state to avoid duplicate delivery.&lt;/p&gt;

&lt;p&gt;For production, set a CloudWatch alarm on &lt;code&gt;ApproximateNumberOfMessagesVisible &amp;gt; 0&lt;/code&gt; for the Scheduler DLQ.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Lesson&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Grafana Cloud OTLP endpoint is the recommended ingestion path; in my trial validation, OTLP Gateway succeeded while Loki Push API returned 530&lt;/td&gt;
&lt;td&gt;Use OTLP Gateway as default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Basic Auth = &lt;code&gt;base64(instanceId:apiToken)&lt;/code&gt;, not plain text&lt;/td&gt;
&lt;td&gt;Auth failures if wrong encoding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Loki / Grafana Cloud can reject old timestamps depending on tenant limits; in my validation, logs older than 7 days were rejected&lt;/td&gt;
&lt;td&gt;Use current timestamps in test data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Grafana HTTP API needs a Grafana Service Account token, not the Grafana Cloud ingestion token used for OTLP writes&lt;/td&gt;
&lt;td&gt;Dashboard creation fails with wrong token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;OTLP-ingested logs use &lt;code&gt;service_name&lt;/code&gt; label, not &lt;code&gt;job&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Different query syntax than Loki Push API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;CloudFormation stack deletion order matters (API GW before Lambda)&lt;/td&gt;
&lt;td&gt;DELETE_FAILED if wrong order&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Verified Query Matrix
&lt;/h2&gt;

&lt;p&gt;In this Grafana Cloud environment, &lt;code&gt;service.name&lt;/code&gt; was exposed as the &lt;code&gt;service_name&lt;/code&gt; index label via Loki's default OTLP attribute-to-label mapping. This mapping is &lt;a href="https://grafana.com/docs/loki/latest/get-started/labels/modify-default-labels/" rel="noopener noreferrer"&gt;configurable per tenant&lt;/a&gt;, so validate labels in your own environment if queries return unexpected results.&lt;/p&gt;

&lt;p&gt;All queries tested with OTLP-ingested fields in this project's Grafana Cloud instance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;th&gt;Verified&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{service_name="fsxn-audit"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Audit logs visible&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`{service_name="fsxn-audit"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;td&gt;Operation="delete"`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`{service_name="fsxn-audit"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;td&gt;Result="Failure"`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`{service_name="fsxn-ems"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;td&gt;event_name="arw.volume.state"`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`{service_name="fsxn-fpolicy"} \&lt;/td&gt;
&lt;td&gt;json \&lt;/td&gt;
&lt;td&gt;operation="create"`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;count_over_time({service_name="fsxn-audit"}[5m])&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Time series data&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Production and PoC Resources
&lt;/h2&gt;

&lt;p&gt;For deeper validation and production planning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/delivery-guarantees.md" rel="noopener noreferrer"&gt;Delivery Guarantee Patterns&lt;/a&gt; — Quickstart → Medium → Higher reliability → Multi-backend&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/webhook-security.md" rel="noopener noreferrer"&gt;Webhook Security Guide&lt;/a&gt; — Auth modes, Lambda authorizer, production baseline&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/operations.md" rel="noopener noreferrer"&gt;Grafana Operations Guide&lt;/a&gt; — Alarms, tuning, poison-pill, ownership, compliance&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/cloudtrail-trigger-alternative.md" rel="noopener noreferrer"&gt;CloudTrail Trigger Alternative&lt;/a&gt; — Event-driven alternative analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt; — Go/No-Go criteria for stakeholder sign-off&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/cost-model.md" rel="noopener noreferrer"&gt;Cost Model&lt;/a&gt; — Direct send vs Collector vs Firehose cost comparison&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/alerting/README.md" rel="noopener noreferrer"&gt;Alerting README&lt;/a&gt; — Provisioning details, thresholds, contact point caveats&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/operations.md#graduating-to-alloy" rel="noopener noreferrer"&gt;Graduating to Alloy&lt;/a&gt; — Move from direct Lambda OTLP send to an Alloy-backed telemetry pipeline&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/grafana/docs/en/partner-solution-brief.md" rel="noopener noreferrer"&gt;Partner Solution Brief&lt;/a&gt; — Target customers, PoC scope, deliverables, and responsibility boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 7&lt;/strong&gt;: Splunk HEC — serverless log delivery with built-in Firehose support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elastic integration&lt;/strong&gt;: Bulk API with date-based indices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost model refinement&lt;/strong&gt;: validate the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/cost-model.md" rel="noopener noreferrer"&gt;Cost Model&lt;/a&gt; with measured volume tiers from real-world FSx for ONTAP workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Why Your FSx for ONTAP Logs Deserve Better&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Shipping FSx for ONTAP Logs to Datadog — The Serverless Way&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing"&gt;FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap-2inb"&gt;Escape Vendor Lock-in with OTel Collector&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 6&lt;/strong&gt;: Direct-to-Grafana: Shipping Logs via OTLP Gateway (this post)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions about the Grafana Cloud integration or OTLP Gateway? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap"&gt;Part 5 — Escape Vendor Lock-in with OTel Collector&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>grafana</category>
      <category>observability</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Tue, 19 May 2026 09:10:53 +0000</pubDate>
      <link>https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap-2inb</link>
      <guid>https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap-2inb</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We shipped the same FSx for ONTAP audit logs to &lt;strong&gt;three backends simultaneously&lt;/strong&gt; — Datadog, Grafana Cloud, and Honeycomb — without changing a single line of Lambda code. The OpenTelemetry Collector sits between our Lambda and the backends as a routing layer. Adding or removing a backend is a YAML config change, not a code deployment.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same audit logs → 3 backends simultaneously&lt;/li&gt;
&lt;li&gt;Zero Lambda code changes between backends (SHA-256 verified)&lt;/li&gt;
&lt;li&gt;OTel Collector as the vendor-neutral routing layer&lt;/li&gt;
&lt;li&gt;All 3 event sources work: FSx audit logs via S3 Access Point, EMS webhooks, FPolicy file operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt;, we built a Lambda that speaks Datadog's API directly. It works great — but what happens when your security team wants Splunk, your SRE team wants Grafana, and your platform team is evaluating Honeycomb?&lt;/p&gt;

&lt;p&gt;You'd need three separate Lambdas, each with vendor-specific formatting, auth, and retry logic. That's vendor lock-in expressed as infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: Vendor-Specific APIs = Lock-in
&lt;/h3&gt;

&lt;p&gt;Every observability vendor has their own wire format:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vendor&lt;/th&gt;
&lt;th&gt;Auth Header&lt;/th&gt;
&lt;th&gt;Payload Format&lt;/th&gt;
&lt;th&gt;Endpoint Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DD-API-KEY: &amp;lt;key&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Custom JSON schema&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://http-intake.logs.{site}/api/v2/logs&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splunk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Splunk &amp;lt;token&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HEC &lt;code&gt;event&lt;/code&gt; wrapper&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://&amp;lt;host&amp;gt;:8088/services/collector/event&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Cloud&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Basic &amp;lt;b64&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://otlp-gateway-prod-&amp;lt;region&amp;gt;.grafana.net/otlp&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;&lt;code&gt;x-honeycomb-team: &amp;lt;key&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://api.honeycomb.io&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your Lambda speaks Datadog's API, switching to Grafana Cloud means rewriting your Lambda. That's the lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: OTLP as the Producer-to-Collector Contract
&lt;/h3&gt;

&lt;p&gt;OpenTelemetry Protocol (OTLP) is the vendor-neutral producer-to-Collector contract. Our Lambda speaks OTLP — period. The OTel Collector handles routing, processing, and backend-specific export.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│ AWS Account                                                         │
│                                                                     │
│  ┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Audit Logs   │────▶│ Lambda           │     │ OTel Collector  │  │
│  │ (via S3 AP)  │────▶│ (OTLP Shipper)   │────▶│ (Docker/Fargate)│  │
│  │ EMS/FPolicy  │────▶│                  │     │                 │  │
│  └──────────────┘     └──────────────────┘     └─┬──────┬──────┬─┘  │
│                                                  │      │      │    │
└──────────────────────────────────────────────────┼──────┼──────┼────┘
                                                   │      │      │
                                                   ▼      ▼      ▼
                                              Datadog  Grafana Honeycomb
                                               (AP1)    Cloud    
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lambda sends OTLP/HTTP to the Collector. The Collector fans out to any combination of backends. Adding Honeycomb? Add 5 lines of YAML. Dropping Datadog? Remove 4 lines. No Lambda redeployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, you need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP with audit logging&lt;/strong&gt; configured (see &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt; for setup)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; installed locally (Colima works — see troubleshooting for compose compatibility)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At least one backend account&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Datadog: API key + site (e.g., &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Grafana Cloud: Instance ID + API token (Cloud Portal → OTLP)&lt;/li&gt;
&lt;li&gt;Honeycomb: Ingest API key (starts with &lt;code&gt;hcaik_&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS account&lt;/strong&gt; with Lambda deployment capability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parts 1–4 context&lt;/strong&gt; (recommended but not required — this integration works standalone)&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;FSx for ONTAP S3 Access Point note&lt;/strong&gt;: The Lambda reads audit logs through an S3 Access Point attached to the FSx for ONTAP volume. Data remains on the FSx file system — it is not copied to a separate S3 bucket. S3 API throughput via FSx depends on the file system's &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/performance.html" rel="noopener noreferrer"&gt;provisioned throughput capacity&lt;/a&gt;, not standard S3 scaling. Validate FSx read throughput separately from Collector and backend ingest throughput.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The OTel Collector Configuration
&lt;/h2&gt;

&lt;p&gt;The Collector config is the heart of this pattern. Here's the full verified configuration for multi-backend delivery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otel-collector-config.yaml&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ VERIFIED WORKING (2026-05-18)&lt;/span&gt;
&lt;span class="c1"&gt;# Image: otel/opentelemetry-collector-contrib:0.152.0&lt;/span&gt;
&lt;span class="c1"&gt;# Backends: Grafana Cloud (ap-northeast-0) + Honeycomb&lt;/span&gt;

&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# memory_limiter:        # Recommended for production&lt;/span&gt;
  &lt;span class="c1"&gt;#   check_interval: 1s&lt;/span&gt;
  &lt;span class="c1"&gt;#   limit_mib: 512&lt;/span&gt;
  &lt;span class="c1"&gt;#   spike_limit_mib: 128&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:GRAFANA_OTLP_ENDPOINT}&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;${env:GRAFANA_BASIC_AUTH}"&lt;/span&gt;

  &lt;span class="na"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://api.honeycomb.io&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_DATASET}&lt;/span&gt;

&lt;span class="na"&gt;extensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:13133&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;extensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Depending on your Honeycomb environment and dataset model, &lt;code&gt;x-honeycomb-dataset&lt;/code&gt; may be optional or handled differently. Refer to your &lt;a href="https://docs.honeycomb.io/send-data/opentelemetry/" rel="noopener noreferrer"&gt;Honeycomb OTLP setup page&lt;/a&gt; for the recommended configuration.&lt;/p&gt;

&lt;p&gt;This article uses &lt;code&gt;otlp_http&lt;/code&gt; (the forward-compatible component name). If your Collector version does not recognize it, use the older &lt;code&gt;otlphttp&lt;/code&gt; alias or upgrade the Collector.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Section Breakdown
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Settings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;receivers.otlp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accepts OTLP/HTTP from Lambda&lt;/td&gt;
&lt;td&gt;Port 4318 (OTLP standard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;processors.batch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Buffers logs before export&lt;/td&gt;
&lt;td&gt;5s timeout OR 1000 records (whichever first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exporters.otlp_http/*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sends to each backend&lt;/td&gt;
&lt;td&gt;Per-backend auth headers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extensions.health_check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Liveness probe&lt;/td&gt;
&lt;td&gt;Port 13133 for &lt;code&gt;curl -f&lt;/code&gt; checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.pipelines&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wires components together&lt;/td&gt;
&lt;td&gt;logs: receiver → processor → exporters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production note&lt;/strong&gt;: This configuration is suitable for development and validation. For production, add &lt;code&gt;retry_on_failure&lt;/code&gt; and &lt;code&gt;sending_queue&lt;/code&gt; settings to exporters, configure &lt;code&gt;memory_limiter&lt;/code&gt; processor, and consider persistent storage extensions. Without persistent buffering, telemetry in the Collector's in-memory batch can be lost during Collector restarts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Adding Datadog as a Third Backend
&lt;/h3&gt;

&lt;p&gt;To send to all three simultaneously, add the Datadog exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... existing grafana + honeycomb exporters ...&lt;/span&gt;

  &lt;span class="na"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:DD_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;site&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:DD_SITE}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Restart the Collector. Same Lambda, same OTLP payload, now three destinations.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For Datadog, this example uses the Collector's dedicated &lt;code&gt;datadog&lt;/code&gt; exporter rather than generic &lt;code&gt;otlp_http&lt;/code&gt;, because it handles Datadog-specific intake behavior, metadata mapping, and host tagging.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Lambda Handler (OTLP Shipper)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Design Decisions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Why OTLP?&lt;/strong&gt; — It gives the Lambda a single producer-to-Collector contract. The Collector then handles each backend's supported exporter or intake path. One format to maintain, not three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why no vendor SDK?&lt;/strong&gt; — SDKs add cold start latency, dependency management, and vendor coupling. Pure &lt;code&gt;urllib3&lt;/code&gt; + JSON keeps the Lambda lean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why AUTH_MODE?&lt;/strong&gt; — Different Collectors may need different auth. The Lambda supports &lt;code&gt;none&lt;/code&gt;, &lt;code&gt;basic&lt;/code&gt;, and &lt;code&gt;bearer&lt;/code&gt; modes without code changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Field Mapping: FSx ONTAP → OTLP Attributes
&lt;/h3&gt;

&lt;p&gt;The Lambda maps FSx ONTAP audit fields to semantic OTLP attribute keys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;FSx ONTAP Field&lt;/th&gt;
&lt;th&gt;OTLP Attribute Key&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EventID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;event.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4663&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UserName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;user.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;admin@corp.local&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ClientIP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;client.address&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.50&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReadData&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ObjectName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/vol/data/reports/q4.xlsx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Success&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SVMName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;The examples above focus on S3 audit logs because they are the highest-volume path. The same OTLP shipper pattern is reused for EMS webhook events and FPolicy file operations using source-specific field mappers (&lt;code&gt;ems_handler.py&lt;/code&gt;, &lt;code&gt;fpolicy_handler.py&lt;/code&gt;), while preserving the same Collector-facing OTLP contract. For EMS and FPolicy, source-specific service names are used (&lt;code&gt;fsxn-ems&lt;/code&gt;, &lt;code&gt;fsxn-fpolicy&lt;/code&gt;) to distinguish event sources in the backend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Resource-level attributes (set once per payload, not per log record):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn-audit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Service identification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.provider&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cloud context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.platform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws_fsx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Platform context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;cloud.platform=aws_fsx&lt;/code&gt; is a project-specific value used to identify FSx for ONTAP as the data source. It is not part of the &lt;a href="https://opentelemetry.io/docs/specs/semconv/resource/cloud/" rel="noopener noreferrer"&gt;OpenTelemetry semantic conventions&lt;/a&gt; standard &lt;code&gt;cloud.platform&lt;/code&gt; values (which include &lt;code&gt;aws_ec2&lt;/code&gt;, &lt;code&gt;aws_ecs&lt;/code&gt;, &lt;code&gt;aws_eks&lt;/code&gt;, &lt;code&gt;aws_lambda&lt;/code&gt;, etc.).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Severity Determination Logic
&lt;/h3&gt;

&lt;p&gt;The Lambda determines OTLP severity from the &lt;code&gt;Result&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;WARN_KEYWORDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;determine_severity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Determine OTLP severity from FSx ONTAP Result field.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;WARN_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means failed access attempts (&lt;code&gt;Result: "Failure"&lt;/code&gt;) automatically get &lt;code&gt;severityNumber: 13&lt;/code&gt; (WARN), making them easy to filter in any backend.&lt;/p&gt;

&lt;p&gt;The Lambda sets both &lt;code&gt;severityNumber&lt;/code&gt; and &lt;code&gt;severityText&lt;/code&gt; according to the &lt;a href="https://opentelemetry.io/docs/specs/otel/logs/data-model/#severity-fields" rel="noopener noreferrer"&gt;OpenTelemetry Logs Data Model&lt;/a&gt; severity level definitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  OTLP Payload Construction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_otlp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build OTLP Log Data Model payload.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;log_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;map_log_record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resourceLogs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attributes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud.provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud.platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_fsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scopeLogs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fsxn-otel-shipper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logRecords&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No vendor SDK. No vendor-specific formatting. Just the OTLP Log Data Model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry with Exponential Backoff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;BASE_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# seconds
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_send_otlp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send OTLP payload via HTTP POST with retry logic.

    Retries on HTTP 429 and 5xx. Does not retry on 4xx (except 429).
    Exponential backoff: 2s, 4s, 8s with jitter.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;json_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json_body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BASE_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="c1"&gt;# Client error (4xx except 429) — don't retry
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AUTH_MODE Support
&lt;/h3&gt;

&lt;p&gt;The Lambda supports three authentication modes via the &lt;code&gt;AUTH_MODE&lt;/code&gt; environment variable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AUTH_MODE&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No auth headers sent&lt;/td&gt;
&lt;td&gt;Local Collector (no auth needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;basic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Basic &amp;lt;base64(token)&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Grafana Cloud direct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bearer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Bearer &amp;lt;token&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generic OTLP endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When using the Collector pattern, set &lt;code&gt;AUTH_MODE=none&lt;/code&gt; on the Lambda — the Collector handles backend auth via its own config.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Direct auth modes (&lt;code&gt;basic&lt;/code&gt;, &lt;code&gt;bearer&lt;/code&gt;) are useful for testing or bypassing the Collector. In the multi-backend pattern, keep &lt;code&gt;AUTH_MODE=none&lt;/code&gt; and let the Collector handle backend credentials.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Local Development: Docker Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Configure credentials&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;integrations/otel-collector
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env with your backend credentials:&lt;/span&gt;
&lt;span class="c"&gt;#   GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp&lt;/span&gt;
&lt;span class="c"&gt;#   GRAFANA_BASIC_AUTH=&amp;lt;base64(instanceId:apiToken)&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#   HONEYCOMB_API_KEY=hcaik_&amp;lt;your-ingest-key&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#   HONEYCOMB_DATASET=fsxn-audit&lt;/span&gt;

&lt;span class="c"&gt;# 2. Start OTel Collector&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; otel-collector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="nt"&gt;-p&lt;/span&gt; 13133:13133 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env-file&lt;/span&gt; .env &lt;span class="se"&gt;\&lt;/span&gt;
  otel/opentelemetry-collector-contrib:0.152.0

&lt;span class="c"&gt;# 3. Verify health&lt;/span&gt;
curl &lt;span class="nt"&gt;-f&lt;/span&gt; http://localhost:13133/
&lt;span class="c"&gt;# Expected: HTTP 200 — {"status":"Server available", ...}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;health_check&lt;/code&gt; extension confirms the Collector process is available; it does not guarantee that each backend exporter is successfully delivering logs. Monitor exporter errors separately using the Collector's internal telemetry metrics if enabled and exposed.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 4. Send a test payload&lt;/span&gt;
bash scripts/generate-otlp-payload.sh &lt;span class="nt"&gt;--output&lt;/span&gt; /tmp/payload.json
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:4318/v1/logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; @/tmp/payload.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Colima users&lt;/strong&gt;: &lt;code&gt;docker compose&lt;/code&gt; v2 plugin is NOT available in Colima. All scripts in this repo detect this and fall back to &lt;code&gt;docker run&lt;/code&gt;. If you see "docker compose: command not found", this is expected behavior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  First Success Path
&lt;/h3&gt;

&lt;p&gt;If you're trying this for the first time, start small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the Collector locally with &lt;strong&gt;one&lt;/strong&gt; backend.&lt;/li&gt;
&lt;li&gt;Send one fresh OTLP payload.&lt;/li&gt;
&lt;li&gt;Confirm the event appears in that backend.&lt;/li&gt;
&lt;li&gt;Add the second exporter.&lt;/li&gt;
&lt;li&gt;Only then move to multi-backend or AWS deployment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This keeps the first validation focused on the producer-to-Collector contract before introducing backend parity and production networking.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Deployment: CloudFormation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/otel-collector/template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-otel-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;S3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;OtlpEndpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://&amp;lt;your-collector-endpoint&amp;gt;:4318 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-otel-key-XXXXXX &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;AuthMode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;This template deploys the Lambda-side OTLP shipper. The Collector endpoint must already be reachable from the Lambda — for example, a local Collector for development, an EC2-hosted Collector, or an ECS/Fargate-based Collector in the same VPC. If the Lambda is in a VPC, ensure security groups allow outbound TCP 4318 to the Collector. See the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/vpc-deployment.md" rel="noopener noreferrer"&gt;VPC Deployment Guide&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/security-hardening.md" rel="noopener noreferrer"&gt;Security Hardening Guide&lt;/a&gt; for production Collector deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the Collector handles auth, set &lt;code&gt;AuthMode=none&lt;/code&gt; on the Lambda. The Collector config contains the per-backend credentials via environment variables (sourced from &lt;code&gt;.env&lt;/code&gt; or Secrets Manager in production).&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Variables
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Lambda&lt;/th&gt;
&lt;th&gt;Collector&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OTLP_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Collector URL (e.g., &lt;code&gt;http://collector:4318&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AUTH_MODE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;none&lt;/code&gt; / &lt;code&gt;basic&lt;/code&gt; / &lt;code&gt;bearer&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SERVICE_NAME&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;OTLP &lt;code&gt;service.name&lt;/code&gt; attribute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GRAFANA_OTLP_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Grafana Cloud OTLP gateway URL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GRAFANA_BASIC_AUTH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;base64(instanceId:apiToken)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HONEYCOMB_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Ingest key (hcaik_...)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HONEYCOMB_DATASET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Dataset name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DD_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Datadog API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DD_SITE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Datadog site (&lt;code&gt;datadoghq.com&lt;/code&gt;, &lt;code&gt;datadoghq.eu&lt;/code&gt;, &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Verified Results
&lt;/h2&gt;

&lt;p&gt;All backends were tested on 2026-05-18 using &lt;code&gt;otel/opentelemetry-collector-contrib:0.152.0&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Region/Site&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Event Sources&lt;/th&gt;
&lt;th&gt;Auth Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;ap1.datadoghq.com&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;Datadog exporter (&lt;code&gt;DD-API-KEY&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Cloud&lt;/td&gt;
&lt;td&gt;ap-northeast-0&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;Basic Auth via &lt;code&gt;otlp_http&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;x-honeycomb-team&lt;/code&gt; via &lt;code&gt;otlp_http&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Backend&lt;/td&gt;
&lt;td&gt;Grafana + Honeycomb&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;Simultaneous delivery&lt;/td&gt;
&lt;td&gt;Both auth methods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Backend&lt;/td&gt;
&lt;td&gt;Datadog + Grafana + Honeycomb&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;Simultaneous 3-way delivery&lt;/td&gt;
&lt;td&gt;All three exporters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three backends received the same structured attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;event.type&lt;/code&gt;, &lt;code&gt;user.name&lt;/code&gt;, &lt;code&gt;client.address&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsxn.operation&lt;/code&gt;, &lt;code&gt;fsxn.path&lt;/code&gt;, &lt;code&gt;fsxn.result&lt;/code&gt;, &lt;code&gt;fsxn.svm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cloud.provider=aws&lt;/code&gt;, &lt;code&gt;cloud.platform=aws_fsx&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;OTLP standardizes the producer-to-Collector contract, but backend-specific indexing, query semantics, and retention behavior still need to be validated per destination. OpenTelemetry is not a backend — it defines APIs, protocols, and Collector components for telemetry generation, collection, processing, and export. Storage, visualization, and alerting are handled by the backends themselves. See the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/backend-parity-matrix.md" rel="noopener noreferrer"&gt;Backend Parity Matrix&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt; for backend-specific validation details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Proof: Zero Code Changes
&lt;/h2&gt;

&lt;p&gt;Here's the key evidence. The Lambda handler's SHA-256 hash is identical regardless of which backend receives the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;shasum &lt;span class="nt"&gt;-a&lt;/span&gt; 256 integrations/otel-collector/lambda/handler.py
&lt;span class="c"&gt;# Same hash whether targeting Datadog, Grafana Cloud, or Honeycomb&lt;/span&gt;
&lt;span class="c"&gt;# The file never changes — only the Collector config does&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What changes between backends? &lt;strong&gt;Only the OTel Collector config file.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Demonstration: Adding a Backend
&lt;/h3&gt;

&lt;p&gt;Starting state: Grafana Cloud only.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: single backend&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding Honeycomb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After: add 5 lines to exporters section + update pipeline&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://api.honeycomb.io&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_DATASET}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart the Collector. Done. No Lambda redeployment, no code review, no CI/CD pipeline for the shipper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Demonstration: Removing a Backend
&lt;/h3&gt;

&lt;p&gt;Dropping Datadog during a migration to Grafana Cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remove from exporters list — that's it&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# removed: datadog&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Timestamp Rejection / Static Payload Gotcha
&lt;/h3&gt;

&lt;p&gt;Datadog documents that logs older than 18 hours are dropped at intake (&lt;a href="https://docs.datadoghq.com/api/latest/logs/" rel="noopener noreferrer"&gt;Datadog Logs API docs&lt;/a&gt;). Other backends may also reject or hide events with timestamps outside their accepted windows. In my testing, future timestamps also caused ingestion issues on some backends. When testing with static payloads, always generate fresh timestamps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Use the payload generator to create fresh timestamps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/generate-otlp-payload.sh &lt;span class="nt"&gt;--output&lt;/span&gt; /tmp/fresh-payload.json
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:4318/v1/logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; @/tmp/fresh-payload.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Grafana Cloud Auth Format
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;loki&lt;/code&gt; exporter is &lt;strong&gt;NOT&lt;/strong&gt; the correct approach for OTLP → Grafana Cloud.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;code&gt;loki&lt;/code&gt; exporter with Loki push API&lt;/li&gt;
&lt;li&gt;✅ &lt;code&gt;otlp_http/grafana&lt;/code&gt; with OTLP gateway endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Basic Auth value must be &lt;code&gt;base64(instanceId:apiToken)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate the auth value&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;your-instance-id&amp;gt;:&amp;lt;your-grafana-cloud-api-token&amp;gt;"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where the instance ID is your numeric Grafana Cloud instance ID (found in Cloud Portal → OTLP configuration).&lt;/p&gt;

&lt;h3&gt;
  
  
  Honeycomb Key Types
&lt;/h3&gt;

&lt;p&gt;Honeycomb has two key types. Only ingest keys work for data ingestion:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Key Prefix&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Works for OTLP?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hcaik_&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ingest API key&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hcxik_&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Environment key&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you see &lt;code&gt;401 Unauthorized&lt;/code&gt; from Honeycomb, check your key prefix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Colima Docker Compose Compatibility
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;docker compose&lt;/code&gt; v2 plugin is not available in Colima environments. All scripts in this repository detect this automatically and fall back to &lt;code&gt;docker run&lt;/code&gt;. This is expected — not an error.&lt;/p&gt;

&lt;p&gt;If you need compose-like orchestration on Colima, use the explicit &lt;code&gt;docker run&lt;/code&gt; commands shown in the Deployment section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Mistake: loki Exporter vs otlp_http
&lt;/h3&gt;

&lt;p&gt;A frequent misconfiguration when targeting Grafana Cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ WRONG — loki exporter uses Loki-specific push API&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;loki&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://logs-prod-&amp;lt;region&amp;gt;.grafana.net/loki/api/v1/push&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ CORRECT — otlp_http uses the OTLP gateway&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://otlp-gateway-prod-&amp;lt;region&amp;gt;.grafana.net/otlp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OTLP gateway is Grafana Cloud's native OTLP ingestion endpoint. It handles logs, metrics, and traces through a single URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Model: How to Think About It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda Cost (OTLP Path vs Direct Send)
&lt;/h3&gt;

&lt;p&gt;In my validation, the OTLP Lambda was simpler and shorter-lived than the vendor-specific direct-send path. Your duration will vary depending on batching, payload size, network path, and backend response time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Direct Send (Part 2)&lt;/th&gt;
&lt;th&gt;OTLP + Collector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda complexity&lt;/td&gt;
&lt;td&gt;Vendor formatting + HTTP + retry&lt;/td&gt;
&lt;td&gt;OTLP POST to nearby Collector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda memory&lt;/td&gt;
&lt;td&gt;256MB&lt;/td&gt;
&lt;td&gt;256MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor SDK deps&lt;/td&gt;
&lt;td&gt;Yes (adds cold start)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry complexity&lt;/td&gt;
&lt;td&gt;Per-vendor&lt;/td&gt;
&lt;td&gt;Delegated to Collector&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  OTel Collector Cost
&lt;/h3&gt;

&lt;p&gt;The Collector introduces a fixed infrastructure cost that is independent of event volume:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Docker on local machine&lt;/td&gt;
&lt;td&gt;Development, testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker on EC2 Spot (t3.small)&lt;/td&gt;
&lt;td&gt;Low-volume production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate (0.5 vCPU, 1GB)&lt;/td&gt;
&lt;td&gt;Production (no OS management)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate + NAT Gateway&lt;/td&gt;
&lt;td&gt;VPC-internal production&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When to Use Each Pattern
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single vendor, low volume&lt;/td&gt;
&lt;td&gt;Direct Send (Part 2 pattern) — no Collector overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single vendor, high volume&lt;/td&gt;
&lt;td&gt;Collector (buffering + backpressure benefits)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-vendor evaluation&lt;/td&gt;
&lt;td&gt;Collector (add/remove exporters freely)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor migration in progress&lt;/td&gt;
&lt;td&gt;Collector (parallel delivery during cutover)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance: logs in multiple systems&lt;/td&gt;
&lt;td&gt;Collector (fan-out is a config change)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Collector has fixed infrastructure costs regardless of volume. As volume increases or vendors multiply, the Collector path becomes more cost-effective because it processes once and fans out. The Collector path centralizes fan-out outside the Lambda. Direct-send can also fan out within one Lambda, but that pushes vendor-specific formatting, retry behavior, and failure isolation back into application code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Backend ingest/retention costs are not included in these AWS-side estimates. Datadog, Grafana Cloud, and Honeycomb each have their own pricing models that can become the dominant cost at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When to Use This Pattern
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Vendor Evaluation
&lt;/h3&gt;

&lt;p&gt;Want to try Honeycomb for a month alongside your existing Datadog setup? Add one exporter to the Collector config. No Lambda redeployment. No risk to your existing pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance: Logs in Multiple Systems
&lt;/h3&gt;

&lt;p&gt;Some organizations require audit logs in multiple systems — security team uses Splunk, dev team uses Datadog, compliance team needs a cold archive. The Collector fans out to all simultaneously from a single OTLP stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Between Vendors
&lt;/h3&gt;

&lt;p&gt;Moving from Datadog to Grafana Cloud? Run both exporters in parallel during migration. Verify data parity in the new system. Remove the old exporter when satisfied. Zero-downtime vendor migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Optimization: Route by Volume
&lt;/h3&gt;

&lt;p&gt;Use the Collector's processor pipeline to route high-volume noisy logs (read operations) to a cheaper backend while keeping security-critical events (deletes, permission changes) on a premium platform with alerting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;For production hardening, the repository includes guides covering VPC deployment, health monitoring, persistent buffering, security hardening, and benchmarking. Auto-scaling and Multi-AZ deployment are natural next steps for production Collector operations.&lt;/p&gt;

&lt;p&gt;For production and partner-led deployments, the repository includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/architecture-decision.md" rel="noopener noreferrer"&gt;Architecture Decision Record&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/vpc-deployment.md" rel="noopener noreferrer"&gt;VPC Deployment Guide&lt;/a&gt; — private networking, security groups, and Collector reachability from Lambda&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/config-governance.md" rel="noopener noreferrer"&gt;Config Governance Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/security-hardening.md" rel="noopener noreferrer"&gt;Security Hardening Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/operations-guide.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/cost-model.md" rel="noopener noreferrer"&gt;Cost Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/routing-filtering-examples.md" rel="noopener noreferrer"&gt;Routing and Filtering Examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/compliance-note.md" rel="noopener noreferrer"&gt;Compliance Evidence Note&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/migration-guide.md" rel="noopener noreferrer"&gt;Migration Guide&lt;/a&gt; — zero-downtime migration from direct-send to the Collector path&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/otel-semantic-mapping.md" rel="noopener noreferrer"&gt;OTel Semantic Mapping Guide&lt;/a&gt; — standard vs project-specific attributes, schema evolution, and what OTLP does not solve&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/backend-parity-matrix.md" rel="noopener noreferrer"&gt;Backend Parity Matrix&lt;/a&gt; — visibility and query behavior across Datadog, Grafana Cloud, and Honeycomb&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/glossary.md" rel="noopener noreferrer"&gt;Glossary / 用語集&lt;/a&gt; — English/Japanese OTel terminology used in this project&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/enterprise-workload-addendum.md" rel="noopener noreferrer"&gt;Enterprise Workload Addendum&lt;/a&gt; — SAP, VMware, and mission-critical workload considerations&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/storage-service-selection.md" rel="noopener noreferrer"&gt;Storage Service Selection Note&lt;/a&gt; — when to use FSx for ONTAP, Amazon S3, Amazon EFS, and Amazon EBS&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OTLP is the stable producer contract&lt;/strong&gt;. Your Lambda speaks one protocol; the Collector handles backend-specific exporters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTel Collector is the routing and processing layer&lt;/strong&gt; that decouples log producers from observability backends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Lambda code changes&lt;/strong&gt; when switching or adding backends — verified with SHA-256 hash comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-backend delivery is a config change&lt;/strong&gt;, not a code change. Add 5 lines of YAML, restart the Collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three FSx ONTAP event sources work&lt;/strong&gt;: FSx audit logs via S3 Access Point (Part 2), EMS webhooks (Part 3), and FPolicy file operations (Part 4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collector economics improve&lt;/strong&gt; as volume increases or vendors multiply — fixed Collector cost is amortized across all destinations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start with direct send&lt;/strong&gt; (Part 2) for simplicity. &lt;strong&gt;Graduate to the Collector&lt;/strong&gt; when you need multi-backend, vendor migration, or volume-based routing.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Why Your FSx for ONTAP Logs Deserve Better&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Shipping FSx for ONTAP Logs to Datadog — The Serverless Way&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing"&gt;FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5&lt;/strong&gt;: Escape Vendor Lock-in with OTel Collector (this post)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions about the OTel Collector pattern or multi-backend delivery? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing"&gt;Part 4 — FPolicy File Activity Pipeline&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>aws</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Mon, 18 May 2026 02:31:34 +0000</pubDate>
      <link>https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing</link>
      <guid>https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;ONTAP FPolicy pushes file operation notifications over a persistent TCP connection. We run a lightweight Python server on ECS Fargate that receives these events, normalizes them, and forwards them to SQS → Lambda → Datadog. In my validation environment, create events reached Datadog in about 6 seconds. Rename/delete behavior depends on FPolicy mode, protocol, and FSx for ONTAP behavior, so this post documents both the working path and the limitations observed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update — production hardening path&lt;/strong&gt;&lt;br&gt;
This article remains the Datadog-specific introduction to the FPolicy file activity pipeline. Since publishing it, the repository has been expanded with production-readiness guidance, governance and security review checklists, sample payloads, CI policy, cfn-guard rules, and shared Python helpers for observability and idempotent object processing.&lt;/p&gt;

&lt;p&gt;For production planning, start from the repository README:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose Your Path&lt;/li&gt;
&lt;li&gt;Recommended first 30 minutes&lt;/li&gt;
&lt;li&gt;Production Readiness Levels&lt;/li&gt;
&lt;li&gt;PoC Success Criteria&lt;/li&gt;
&lt;li&gt;Security Review Checklist&lt;/li&gt;
&lt;li&gt;Governance and Compliance Guide&lt;/li&gt;
&lt;li&gt;CI Policy&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The FPolicy pattern has also been expanded with Persistent Store guidance, idempotent object processing, EventBridge dispatch, and a hybrid polling/event-driven migration path. This Part 4 article focuses on the Datadog delivery path; the repository now documents the broader production baseline.
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why FPolicy Needs Fargate
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3&lt;/a&gt;, we showed how EMS webhooks deliver ARP alerts via API Gateway → Lambda. That works because EMS uses standard HTTPS.&lt;/p&gt;

&lt;p&gt;FPolicy is different. ONTAP's FPolicy subsystem uses a &lt;strong&gt;proprietary binary protocol over persistent TCP connections&lt;/strong&gt;. ONTAP initiates the connection to the FPolicy server and maintains it with periodic KeepAlive messages. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;strong&gt;Lambda&lt;/strong&gt; — No persistent TCP connections, max 15-minute timeout&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;API Gateway&lt;/strong&gt; — HTTP/HTTPS only, no raw TCP&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;ECS Fargate&lt;/strong&gt; — Persistent TCP listener, private IP, auto-restart&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why I Did Not Use an NLB in This Validation
&lt;/h3&gt;

&lt;p&gt;I tested an NLB-based approach, but it did not work reliably in my validation. The issue was not that NLB cannot forward binary TCP traffic; it can. The challenge was FPolicy's stateful session negotiation and ONTAP's expectation of configured FPolicy server IPs. Health checks and connection behavior introduced additional complexity. For this validation, the simplest reliable path was to let ONTAP connect directly to the Fargate task's private IP and automate external-engine IP updates on task restart.&lt;/p&gt;

&lt;p&gt;The Fargate task runs a Python server that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Listens on TCP:9898&lt;/li&gt;
&lt;li&gt;Handles FPolicy protocol negotiation (version handshake)&lt;/li&gt;
&lt;li&gt;Receives KeepAlive messages (connection health)&lt;/li&gt;
&lt;li&gt;Parses file operation notifications&lt;/li&gt;
&lt;li&gt;Forwards structured events to SQS&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SMB/NFS Client
    │ file create/write/rename/delete
    ▼
FSx for ONTAP (FPolicy enabled)
    │ proprietary TCP protocol
    ▼
ECS Fargate (TCP:9898)
    │ parse → normalize → forward
    ▼
SQS Queue
    │ event source mapping
    ▼
Lambda (fpolicy_handler)
    │ format → ship
    ▼
Datadog Logs API v2 (source:fsxn-fpolicy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP connects TO Fargate&lt;/strong&gt; — the Fargate task must be reachable on a private IP. Because that IP can change on task restart, the ONTAP external engine must be updated automatically or operationally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQS decouples&lt;/strong&gt; the TCP server from the shipping logic — if Datadog is slow, events buffer in SQS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda handles Datadog shipping&lt;/strong&gt; — retry logic, batch formatting, API key management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No NLB&lt;/strong&gt; — ONTAP connects directly to the Fargate task's private IP&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Production Boundary: Why FPolicy Needs More Than Lambda
&lt;/h2&gt;

&lt;p&gt;The audit-log and EMS paths are natural fits for Lambda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit logs are file/object reads through the FSx for ONTAP S3 Access Point read path&lt;/li&gt;
&lt;li&gt;EMS events are HTTPS webhook payloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FPolicy is different. ONTAP FPolicy uses a persistent TCP connection to an external FPolicy server. That makes it a poor fit for API Gateway + Lambda as the first receiver.&lt;/p&gt;

&lt;p&gt;This is why the production-oriented path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ONTAP FPolicy
  → ECS Fargate TCP listener
  → SQS
  → Lambda shipper
  → Datadog

## Deployment

### Prerequisites

- FSx for ONTAP file system with a CIFS-enabled SVM
- VPC with private subnets (same as FSx for ONTAP)
- ECR repository with the FPolicy server image
- Private subnet egress for Fargate: either a NAT Gateway or VPC endpoints for ECR image pull, CloudWatch Logs, and SQS access

### Step 1: Deploy the Fargate Stack

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
aws cloudformation deploy \&lt;br&gt;
  --template-file shared/templates/fpolicy-server-fargate.yaml \&lt;br&gt;
  --stack-name fsxn-fpolicy-server \&lt;br&gt;
  --parameter-overrides \&lt;br&gt;
    VpcId= \&lt;br&gt;
    SubnetIds= \&lt;br&gt;
    FsxnSvmSecurityGroupId= \&lt;br&gt;
    ContainerImage=.dkr.ecr..amazonaws.com/fsxn-fpolicy-server:latest \&lt;br&gt;
  --capabilities CAPABILITY_NAMED_IAM&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This creates:
- ECS Cluster + Fargate Service (1 task)
- SQS Queue for FPolicy events
- Security Group (inbound TCP:9898 from FSx SG)
- CloudWatch Log Group

### Step 2: Deploy the Datadog Shipping Lambda

The template accepts the SQS queue ARN as a parameter and automatically creates the event source mapping:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;
&lt;h1&gt;
  
  
  Get the SQS queue ARN from Step 1 outputs
&lt;/h1&gt;

&lt;p&gt;SQS_ARN=$(aws cloudformation describe-stacks \&lt;br&gt;
  --stack-name fsxn-fpolicy-server \&lt;br&gt;
  --query "Stacks[0].Outputs[?OutputKey=='FPolicyQueueArn'].OutputValue" \&lt;br&gt;
  --output text)&lt;/p&gt;

&lt;p&gt;aws cloudformation deploy \&lt;br&gt;
  --template-file integrations/datadog/template-ems-fpolicy.yaml \&lt;br&gt;
  --stack-name fsxn-datadog-ems-fpolicy \&lt;br&gt;
  --parameter-overrides \&lt;br&gt;
    DatadogApiKeySecretArn= \&lt;br&gt;
    DatadogSite=ap1.datadoghq.com \&lt;br&gt;
    FPolicySqsQueueArn=${SQS_ARN} \&lt;br&gt;
  --capabilities CAPABILITY_NAMED_IAM&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This creates the Lambda function with an SQS event source mapping — no manual `create-event-source-mapping` needed.

### Step 3: Get the Fargate Task IP

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
TASK_ARN=$(aws ecs list-tasks \&lt;br&gt;
  --cluster fsxn-fpolicy-server-cluster \&lt;br&gt;
  --service-name fsxn-fpolicy-server-service \&lt;br&gt;
  --query "taskArns[0]" --output text)&lt;/p&gt;

&lt;p&gt;aws ecs describe-tasks \&lt;br&gt;
  --cluster fsxn-fpolicy-server-cluster \&lt;br&gt;
  --tasks $TASK_ARN \&lt;br&gt;
  --query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" \&lt;br&gt;
  --output text&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
## ONTAP FPolicy Configuration

&amp;gt; **CLI note**: Some ONTAP versions show these commands under `vserver fpolicy ...`, while newer CLI contexts may allow shortened forms. Use the command form supported by your ONTAP version. The examples below use the form validated in my environment (FSx for ONTAP 9.17.1). See [NetApp CLI reference](https://docs.netapp.com/us-en/ontap-cli-9151/vserver-fpolicy-policy-external-engine-create.html) for the full command syntax.

FPolicy requires three components: an External Engine (where to send events), an Event (what to monitor), and a Policy (linking them together).

### Create the External Engine

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
shell&lt;br&gt;
vserver fpolicy policy external-engine create -vserver  \&lt;br&gt;
  -engine-name fpolicy_aws_engine \&lt;br&gt;
  -primary-servers  \&lt;br&gt;
  -port 9898 \&lt;br&gt;
  -extern-engine-type asynchronous \&lt;br&gt;
  -ssl-option no-auth&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&amp;gt; **Production note**: For production deployments, evaluate `server-auth` or `mutual-auth` instead of `no-auth`, and validate certificate handling between ONTAP and the FPolicy server. See [NetApp FPolicy external engine documentation](https://docs.netapp.com/us-en/ontap/nas-audit/create-fpolicy-external-engine-task.html).

### Create the FPolicy Event

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
shell&lt;br&gt;
vserver fpolicy policy event create -vserver  \&lt;br&gt;
  -event-name cifs_file_events \&lt;br&gt;
  -protocol cifs \&lt;br&gt;
  -file-operations create,write,rename,delete&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&amp;gt; **Tip**: For write-heavy workloads, review the protocol-specific FPolicy filters supported by your ONTAP version and protocol. Where supported, use close/modify-oriented filters to reduce duplicate or noisy write events.

### Create and Enable the Policy

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
shell&lt;br&gt;
vserver fpolicy policy create -vserver  \&lt;br&gt;
  -policy-name fpolicy_aws \&lt;br&gt;
  -events cifs_file_events \&lt;br&gt;
  -engine fpolicy_aws_engine \&lt;br&gt;
  -is-mandatory false&lt;/p&gt;

&lt;p&gt;vserver fpolicy enable -vserver  \&lt;br&gt;
  -policy-name fpolicy_aws \&lt;br&gt;
  -sequence-number 1&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This example uses an asynchronous, non-mandatory policy so client file operations are not blocked by FPolicy server processing or Datadog delivery. If the FPolicy server is unavailable, file operations continue unimpeded — but notifications may be buffered or lost depending on your ONTAP version and configuration.

### Verify Connection

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
shell&lt;br&gt;
vserver fpolicy show-engine -vserver  -engine-name fpolicy_aws_engine&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
You should see `connected` status. In the ECS logs, KeepAlive messages confirm the connection:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
console&lt;br&gt;
[INFO] fpolicy-server: [+] Connection from ('10.0.x.x', 44107)&lt;br&gt;
[INFO] fpolicy-server: [Handshake] Policy=fpolicy_aws | Session=... | VsUUID=...&lt;br&gt;
[INFO] fpolicy-server: [Send] NEGO_RESP | Version=1.2 | Policy=fpolicy_aws&lt;br&gt;
[INFO] fpolicy-server: [KeepAlive] Received — connection healthy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
## E2E Validation Results

File operations on the SMB share produce events that flow through the entire pipeline:

| Operation | ECS Log | SQS | Lambda | Datadog | Latency |
|-----------|---------|-----|--------|---------|---------|
| create `blog_demo_create.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `blog_demo_write.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `confidential_report_2026.xlsx` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |

### ECS Fargate Logs — Connection Lifecycle

The FPolicy server logs show the complete lifecycle: server start → ONTAP connection → protocol handshake → KeepAlive → file events → SQS delivery.

![ECS Fargate CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-ecs-fpolicy-logs.png)

### Lambda CloudWatch Logs — Event Processing

Each SQS message triggers a Lambda invocation. Processing time is typically 30-50ms per event.

![Lambda CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-lambda-fpolicy-logs.png)

### Datadog Log Explorer

Query: `source:fsxn-fpolicy`

Each event contains structured attributes:
- `operation_type`: The file operation (create, write, rename, delete)
- `file_path`: The file that was operated on
- `client_ip`: The client that performed the operation
- `volume_name`: The ONTAP volume
- `svm`: The ONTAP SVM name (may show "unknown" if not resolved from handshake context)
- `timestamp`: When the operation occurred

![FPolicy events in Datadog Log Explorer](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-full-path.png)

![FPolicy event detail — structured attributes visible in the side panel](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-detail.png)

## Correlating FPolicy with ARP

The real power emerges when you combine FPolicy file activity with ARP ransomware detection from Part 3:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
plaintext&lt;br&gt;
source:(fsxn-fpolicy OR fsxn-ems) @attributes.svm:svm-prod-01&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This correlation query shows:
1. **ARP alert** (from EMS): "Ransomware detected on volume X"
2. **File operations** (from FPolicy): Which user, from which IP, created/renamed which files

Together they answer the critical incident response questions: *What happened, who did it, and from where?*

### Security Use Case: Detecting Suspicious File Creation Bursts

With FPolicy create events in Datadog, you can create a Monitor that fires when a single client creates more than 50 files in 5 minutes — a potential indicator of ransomware encryption or unauthorized bulk operations:

**Datadog Monitor query:**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
plaintext&lt;br&gt;
logs("source:fsxn-fpolicy @attributes.operation_type:create").rollup("count").by("@attributes.client_ip").last("5m") &amp;gt; 50&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**Alert message:**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
plaintext&lt;br&gt;
🚨 Suspicious file creation burst detected on FSx for ONTAP&lt;/p&gt;

&lt;p&gt;Client IP: {{@attributes.client_ip}}&lt;br&gt;
Volume: {{@attributes.volume_name}}&lt;br&gt;
Count: {{value}} file creations in 5 minutes&lt;/p&gt;

&lt;p&gt;Investigate immediately — check if this is authorized batch processing or potential ransomware activity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&amp;gt; **Note on delete monitoring**: If your FPolicy configuration and ONTAP version reliably deliver delete events (e.g., synchronous mode or a future ONTAP release), you can extend this pattern to bulk deletion detection. In my async-mode validation, delete notifications were not reliably delivered — I recommend using audit logs from [Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c) for delete-event completeness.

This is difficult to achieve with traditional audit log polling, which depends on rotation and scheduler intervals. FPolicy's event-driven delivery makes sub-minute detection possible for the operations it reliably captures.

## Operational Considerations

### Fargate Task IP Changes

When a Fargate task restarts (deployment, crash, scaling), it gets a new private IP. ONTAP's External Engine must be updated with the new IP. Options:

1. **Manual update**: `vserver fpolicy policy external-engine modify -primary-servers &amp;lt;new-ip&amp;gt;`
2. **Automated**: Lambda triggered by ECS task state change → ONTAP REST API update

The repository includes a helper script (`shared/scripts/fpolicy-update-engine-ip.sh --auto`) that detects the current task IP and updates the ONTAP engine. For full automation, wire an EventBridge rule on ECS task state changes to an update Lambda — this is not included in the base stack but is straightforward to add. Automated updates require network reachability to the ONTAP management endpoint and credentials (stored in Secrets Manager) with permission to modify the FPolicy external engine.

### Restart Resilience — Validated

I tested the full restart cycle to confirm the pipeline recovers gracefully:

| Step | Result | Time |
|------|--------|------|
| Stop Fargate (scale to 0) | Task stopped | ~30s |
| Restart Fargate (scale to 1) | New task, new IP | ~45s |
| Update ONTAP Engine IP | Reconnection | ~20s |
| File operation after restart | Event delivered to Datadog | ~6s |
| **Total recovery time** | | **~2 minutes** |

The Lambda's retry logic also proved itself: on the first request after reconnection, a transient `RemoteDisconnected` error occurred. The exponential backoff retry succeeded on the second attempt — exactly the behavior we designed for.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
console&lt;br&gt;
[WARNING] HTTP error shipping to Datadog (attempt 1/3): RemoteDisconnected&lt;br&gt;
[INFO]    Processing complete: {"statusCode": 200, "body": {"shipped": 1}}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### Cost Profile

| Component | Monthly Cost (estimate) |
|-----------|------------------------|
| Fargate (0.25 vCPU, 0.5 GB) | ~$10 |
| SQS (low volume) | &amp;lt; $1 |
| Lambda (event-driven) | &amp;lt; $1 |
| CloudWatch Logs | ~$2 |
| **Total** | **~$14/month** |

Compare this to an always-on EC2-based collector, plus OS patching, agent management, and HA considerations. Exact EC2 costs vary by region and instance type.

&amp;gt; This is an AWS-side estimate and excludes Datadog ingest/retention costs, NAT Gateway or VPC endpoint charges, ECR storage, and high-volume CloudWatch Logs.

### Scaling

A single Fargate task is sufficient for the low-volume validation scenarios in this post. The architecture can scale by tuning Fargate CPU/memory, SQS buffering, and Lambda concurrency, but you should benchmark your own workload before assuming a specific events/sec capacity.

### Monitoring

Key CloudWatch metrics to watch:
- `ECS/CPUUtilization` — Fargate task health
- `SQS/ApproximateNumberOfMessagesVisible` — Queue depth (should stay near 0)
- `Lambda/Errors` — Shipping failures
- `Lambda/Duration` — Processing time per batch

## The FPolicy Server

The FPolicy server (`shared/fpolicy-server/fpolicy_server.py`) implements:

- **Protocol negotiation**: Responds to ONTAP's version handshake
- **KeepAlive handling**: Acknowledges connection health checks
- **Event parsing**: Extracts file path, operation, user, client IP from binary frames
- **SQS forwarding**: Sends normalized JSON events to the queue
- **Write coalescing**: Configurable delay to batch rapid write events (default: 5 seconds)

The server runs in `realtime` mode — events are forwarded as they arrive, with optional write-complete delay to avoid duplicate notifications for multi-write operations.

## Limitations and Future Work

### Rename/Delete Events Not Delivered in Async Mode

In my E2E testing, ONTAP did not deliver rename or delete notifications to the FPolicy server in asynchronous mode — even though these operations are configured in the FPolicy event definition. Only create events were reliably delivered. This appears to be a limitation of FSx for ONTAP's FPolicy implementation in async mode for certain operation types.

**Workaround options:**
- Use synchronous mode (adds latency to file operations — not recommended for production)
- Combine FPolicy (event-driven create) with audit log polling (catches rename/delete in EVTX)
- Accept create-only monitoring for event-driven alerting, use audit logs for forensic completeness

### NFS Protocol Support

| Protocol | FPolicy Support | Notes |
|----------|----------------|-------|
| SMB/CIFS | ✅ Verified | Primary validation protocol |
| NFSv3 | ✅ Supported | Requires explicit `vers=3` mount option |
| NFSv4.0 | ✅ Supported | Requires explicit `vers=4.0` |
| NFSv4.1 | ✅ Supported | Requires ONTAP 9.15.1+, explicit `vers=4.1` |
| NFSv4.2 | ❌ Not supported | ONTAP FPolicy does not monitor NFSv4.2 operations |

For protocol support details, verify your ONTAP version. NetApp [documents](https://kb.netapp.com/onprem/ontap/da/NAS/Does_ONTAP_support_FPolicy_for_NFS_4.2) that FPolicy does not currently support NFSv4.2; supported NFS protocols include NFSv3, NFSv4.0, and NFSv4.1 (ONTAP 9.15.1+).

**Critical gotcha:** `mount -o vers=4` on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does **not** support. Always use explicit version: `mount -o vers=4.1` or `vers=3`.

**NFS + FPolicy latency:** NFSv3 lacks close semantics, so the FPolicy server cannot know when a write is complete. The server uses a configurable `WRITE_COMPLETE_DELAY_SEC` (default: 5s) to wait before forwarding the event. This adds latency but prevents premature processing of incomplete files.

**NFS write hang (observed):** In some configurations, NFS write operations may hang when FPolicy is enabled — even with `is-mandatory=false`. This is a [known ONTAP behavior](https://kb.netapp.com/onprem/ontap/da/NAS/NFS_hung_slowness_issue_when_dealing_with_long_path_names_with_FPolicy_enabled) related to FPolicy notification processing. If you experience this, verify your ONTAP version and consider limiting FPolicy scope to specific volumes.

### User Identity

In the current implementation, the `user` field may be empty for some operations depending on ONTAP's FPolicy notification content. The FPolicy binary frame includes user identity in extended attributes that require additional parsing logic. Future versions will extract this from the NOTI_REQ body.

### Event Durability During Restarts

In my validation, events generated while the Fargate server was disconnected were not observed downstream in Datadog after reconnection. Treat FPolicy delivery during server outages as something you must validate in your own environment.

ONTAP [documentation](https://docs.netapp.com/us-en/ontap/nas-audit/synchronous-asynchronous-notifications-concept.html) describes buffering behavior for asynchronous notifications — notifications generated during a network outage are stored on the storage node and can be fetched when the server comes back online. Beginning with ONTAP 9.14.1, [FPolicy persistent store](https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html) support is available for asynchronous non-mandatory policies. If you cannot tolerate event loss during FPolicy server restarts, evaluate persistent store and validate the behavior on your FSx for ONTAP version.

## Try It Yourself

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;

&lt;h1&gt;
  
  
  Clone the repository
&lt;/h1&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations.git" rel="noopener noreferrer"&gt;https://github.com/Yoshiki0705/fsxn-observability-integrations.git&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Deploy prerequisites (if not already done)
&lt;/h1&gt;

&lt;p&gt;aws cloudformation deploy \&lt;br&gt;
  --template-file shared/templates/fpolicy-server-fargate.yaml \&lt;br&gt;
  --stack-name fsxn-fpolicy-server \&lt;br&gt;
  --parameter-overrides \&lt;br&gt;
    VpcId= \&lt;br&gt;
    SubnetIds= \&lt;br&gt;
    FsxnSvmSecurityGroupId= \&lt;br&gt;
    ContainerImage= \&lt;br&gt;
  --capabilities CAPABILITY_NAMED_IAM&lt;/p&gt;

&lt;h1&gt;
  
  
  Configure ONTAP FPolicy (see ONTAP section above)
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Create a file on the SMB share
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Check Datadog: source:fsxn-fpolicy
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Where FPolicy Fits in ONTAP Telemetry

This series covers three ONTAP telemetry sources. Each serves a different purpose:

| Use Case | Best Source | Latency | Coverage |
|----------|-------------|---------|----------|
| Compliance audit trail | Audit logs ([Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)) | Minutes (scheduler interval) | Complete historical record |
| Ransomware detection | ARP via EMS ([Part 3](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)) | ~30 seconds (webhook) | ML-based pattern detection |
| Event-driven file activity signal | FPolicy (this post) | ~6 seconds (TCP) | Create events validated; other operations depend on mode/version |
| Forensic investigation | Audit logs + FPolicy correlation | Combined | Timeline reconstruction |

**FPolicy is not a replacement for audit logs.** It provides an event-driven signal for detection and alerting. Audit logs provide the authoritative, complete historical record for compliance and forensics. Use them together.

## Key Takeaways

1. **Use Fargate for FPolicy TCP listener** — Lambda cannot maintain persistent TCP connections. Fargate provides the long-running listener without OS management.
2. **Use SQS to decouple ingestion from shipping** — If Datadog is slow or Lambda is throttled, events buffer safely in SQS.
3. **Validate operation coverage in your environment** — Async mode reliably delivered create events in my testing. Rename/delete behavior varies by ONTAP version and mode.
4. **Use audit logs for forensic completeness** — FPolicy provides event-driven signal for detection; audit logs (Part 2) provide the complete historical record.
5. **Treat FPolicy as event-driven alerting, not full audit replacement** — The two are complementary, not interchangeable.

## Production Considerations Beyond This Validation

This post validates the end-to-end path. For production deployments, the following topics warrant additional design work:

| Topic | Key Questions |
|-------|--------------|
| **HA / Multi-AZ** | ONTAP external engine supports `primary-servers` and `secondary-servers`. How to run multiple Fargate tasks across AZs? |
| **Scope Design** | Which volumes, operations, and protocols to monitor? How to avoid noisy workloads? |
| **Security Hardening** | TLS/mTLS for FPolicy, ECR image scanning, VPC Flow Logs, task role least-privilege |
| **Cost Model** | FPolicy generates events per file operation — Datadog ingest can become the dominant cost at scale |
| **Operations Runbook** | Task restart, engine disconnected, SQS backlog, Datadog missing events, NFS hang |
| **Stable Endpoint** | Auto-update Lambda for engine IP, or primary/secondary server design for zero-downtime restarts |

These topics are documented in the repository:

- **[Production Architecture Patterns](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-production-architecture-patterns.md)** — Single task, primary/secondary, auto-update, multi-AZ patterns with failure mode matrix
- **[Operational Guide](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-operational-guide.md)** — 4-layer health model, runbooks, IP reconciliation, synthetic health check
- **[PoC Checklist](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-poc-checklist.md)** — Preconditions, scope, validation steps, success criteria, go/no-go

Contributions and questions are welcome.

## Series Navigation

- **Part 1**: [Why Your FSx for ONTAP Logs Deserve Better](https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod)
- **Part 2**: [Shipping FSx for ONTAP Logs to Datadog, The Serverless Way](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)
- **Part 3**: [Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)
- **Part 4**: FPolicy File Activity Pipeline (this post)

Coming next:
- **Splunk**: Replacing EC2 + Universal Forwarder with Lambda + HEC
- **OpenTelemetry**: The vendor-neutral escape hatch

---

*Questions about FPolicy or the Fargate architecture? Drop a comment below.*

*Previous: [Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)*

**GitHub**: [github.com/Yoshiki0705/fsxn-observability-integrations](https://github.com/Yoshiki0705/fsxn-observability-integrations)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Operational Hardening — Guardrails, Secrets Rotation &amp; SLO — FSx ONTAP S3AP Phase 12</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 18:21:39 +0000</pubDate>
      <link>https://dev.to/aws-builders/operational-hardening-guardrails-secrets-rotation-slo-fsx-ontap-s3ap-phase-12-1k4o</link>
      <guid>https://dev.to/aws-builders/operational-hardening-guardrails-secrets-rotation-slo-fsx-ontap-s3ap-phase-12-1k4o</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.&lt;/p&gt;

&lt;p&gt;Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Phase 12&lt;/strong&gt; of the FSx for ONTAP S3AP serverless pattern library. Building on &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt; and &lt;a href="https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8"&gt;Phase 11&lt;/a&gt;, Phase 12 delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capacity Guardrails&lt;/strong&gt;: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Rotation&lt;/strong&gt;: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic Monitoring&lt;/strong&gt;: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity Forecasting&lt;/strong&gt;: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Lineage Tracking&lt;/strong&gt;: DynamoDB table with GSI for processing history and opt-in integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protobuf TCP Framing&lt;/strong&gt;: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO Definition&lt;/strong&gt;: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy Pipeline E2E&lt;/strong&gt;: NFS file creation → FPolicy → SQS delivery confirmed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store Replay&lt;/strong&gt;: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Property-Based Testing&lt;/strong&gt;: 16 Hypothesis properties, 53 tests, 3 bugs discovered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Point Deep Dive&lt;/strong&gt;: Multi-layer authorization, IAM ARN format, VPC network constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key metrics&lt;/strong&gt;: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph LR
    A[Auto-Expand Request] --&amp;gt; B{GuardrailMode?}
    B --&amp;gt;|DRY_RUN| C[Log + Allow&amp;lt;br/&amp;gt;fail-open on DDB error]
    B --&amp;gt;|ENFORCE| D[Check + Block&amp;lt;br/&amp;gt;fail-closed on DDB error]
    B --&amp;gt;|BREAK_GLASS| E[Bypass All Checks&amp;lt;br/&amp;gt;SNS Alert + Audit Log]
    C --&amp;gt; F[DynamoDB Tracking]
    D --&amp;gt; F
    E --&amp;gt; F
    F --&amp;gt; G[CloudWatch EMF Metrics]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Behavior on Check Failure&lt;/th&gt;
&lt;th&gt;Behavior on DynamoDB Error&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DRY_RUN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Log warning, allow action&lt;/td&gt;
&lt;td&gt;Fail-open (allow)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ENFORCE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Block action, emit metric&lt;/td&gt;
&lt;td&gt;Fail-closed (deny)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BREAK_GLASS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip all checks&lt;/td&gt;
&lt;td&gt;SNS alert + audit log&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Core implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.guardrails&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CapacityGuardrail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GuardrailMode&lt;/span&gt;

&lt;span class="n"&gt;guardrail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CapacityGuardrail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Mode from GUARDRAIL_MODE env var
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guardrail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_and_execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;action_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume_grow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;requested_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;execute_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_grow_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;volume_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vol-abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action executed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action denied: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Three safety checks (ENFORCE mode)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit&lt;/strong&gt;: Max 10 actions per day per action type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily cap&lt;/strong&gt;: Max 500 GB cumulative expansion per day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooldown&lt;/strong&gt;: 300-second minimum interval between actions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All thresholds are configurable via environment variables (&lt;code&gt;GUARDRAIL_RATE_LIMIT&lt;/code&gt;, &lt;code&gt;GUARDRAIL_DAILY_CAP_GB&lt;/code&gt;, &lt;code&gt;GUARDRAIL_COOLDOWN_SECONDS&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  DynamoDB tracking schema
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Action type (e.g., &lt;code&gt;volume_grow&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Date (&lt;code&gt;YYYY-MM-DD&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;daily_total_gb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;Cumulative GB expanded today&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;action_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;Number of actions today&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;last_action_ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;ISO timestamp of last action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;actions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List&lt;/td&gt;
&lt;td&gt;Audit trail of all actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ttl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;30-day auto-expiry&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-dynamodb-guardrails-table.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-dynamodb-guardrails-table.png" alt="DynamoDB Guardrails Table" width="800" height="1272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  BREAK_GLASS production considerations
&lt;/h3&gt;

&lt;p&gt;In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM-&amp;gt;&amp;gt;Lambda: Step 1: createSecret
    Lambda-&amp;gt;&amp;gt;SM: Generate new password, store as AWSPENDING

    SM-&amp;gt;&amp;gt;Lambda: Step 2: setSecret
    Lambda-&amp;gt;&amp;gt;ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP--&amp;gt;&amp;gt;Lambda: 200 OK

    SM-&amp;gt;&amp;gt;Lambda: Step 3: testSecret
    Lambda-&amp;gt;&amp;gt;ONTAP: GET /api/cluster (using new password)
    ONTAP--&amp;gt;&amp;gt;Lambda: 200 OK (cluster UUID returned)

    SM-&amp;gt;&amp;gt;Lambda: Step 4: finishSecret
    Lambda-&amp;gt;&amp;gt;SM: Promote AWSPENDING → AWSCURRENT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key design decisions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VPC deployment&lt;/strong&gt;: Lambda must be in the same VPC as the ONTAP management LIF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90-day interval&lt;/strong&gt;: Configurable via CloudFormation parameter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: Step 3 (&lt;code&gt;testSecret&lt;/code&gt;) verifies the new password works by calling the ONTAP cluster API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollback safety&lt;/strong&gt;: If &lt;code&gt;testSecret&lt;/code&gt; fails, the old password remains as AWSCURRENT&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bugs discovered during live testing
&lt;/h3&gt;

&lt;p&gt;Three bugs were found and fixed during the actual rotation execution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AWSPENDING empty check&lt;/strong&gt;: &lt;code&gt;createSecret&lt;/code&gt; must handle the case where &lt;code&gt;get_secret_value(VersionStage='AWSPENDING')&lt;/code&gt; raises &lt;code&gt;ResourceNotFoundException&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;management_ip fallback&lt;/strong&gt;: The Lambda must support both &lt;code&gt;management_ip&lt;/code&gt; (new) and &lt;code&gt;ontap_mgmt_ip&lt;/code&gt; (legacy) keys in the secret JSON&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster UUID validation&lt;/strong&gt;: &lt;code&gt;testSecret&lt;/code&gt; now validates the response contains a valid &lt;code&gt;uuid&lt;/code&gt; field, not just HTTP 200&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Verification result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Operational note
&lt;/h3&gt;

&lt;p&gt;Rotating &lt;code&gt;fsxadmin&lt;/code&gt; affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's &lt;code&gt;urllib3&lt;/code&gt; or &lt;code&gt;requests&lt;/code&gt; configuration handles certificate verification appropriately (see &lt;code&gt;shared/ontap_client.py&lt;/code&gt; for the pattern used in this project).&lt;/p&gt;

&lt;p&gt;For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing &lt;code&gt;fsxadmin&lt;/code&gt; across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Synthetic Monitoring — CloudWatch Synthetics Canary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP Health Check&lt;/strong&gt;: REST API call to the management endpoint (VPC-internal)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Point Check&lt;/strong&gt;: ListObjectsV2 against the S3AP alias&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Critical finding: network-origin and endpoint configuration matter
&lt;/h3&gt;

&lt;p&gt;During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.&lt;/p&gt;

&lt;p&gt;This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.&lt;/p&gt;

&lt;p&gt;In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Observed requirement in this environment&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP REST API&lt;/td&gt;
&lt;td&gt;VPC-internal access to management LIF&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3AP health check&lt;/td&gt;
&lt;td&gt;Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy&lt;/td&gt;
&lt;td&gt;⚠️ Timed out from the initial VPC Canary configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Split into two monitoring paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONTAP health: VPC-internal Canary (confirmed working, 88ms response)&lt;/li&gt;
&lt;li&gt;S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is documented as a critical constraint in &lt;code&gt;docs/guides/s3ap-fsxn-specification.md&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canary runtime version lesson
&lt;/h3&gt;

&lt;p&gt;The template initially specified &lt;code&gt;syn-python-selenium-3.0&lt;/code&gt;, which was deprecated on 2026-02-03. Updated to &lt;code&gt;syn-python-selenium-11.0&lt;/code&gt;. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS builder lesson: VPC placement is a design choice
&lt;/h3&gt;

&lt;p&gt;A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc-internet.html" rel="noopener noreferrer"&gt;connected to a VPC&lt;/a&gt;, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ztvwmwi58ki19r00lr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ztvwmwi58ki19r00lr.png" alt="Synthetics Canary" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Capacity Forecasting — Linear Regression with stdlib Only
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A Lambda function running on a daily EventBridge schedule:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetches 30 days of FSx &lt;code&gt;StorageUsed&lt;/code&gt; metrics from CloudWatch&lt;/li&gt;
&lt;li&gt;Performs linear regression using only Python's &lt;code&gt;math&lt;/code&gt; module (zero external dependencies)&lt;/li&gt;
&lt;li&gt;Publishes &lt;code&gt;DaysUntilFull&lt;/code&gt; as a CloudWatch custom metric&lt;/li&gt;
&lt;li&gt;Sends SNS alert when forecast drops below threshold (default: 30 days)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Linear regression implementation (stdlib only)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;linear_regression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Least-squares linear regression using only math module.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Need at least 2 data points for regression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
        &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
        &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
        &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

    &lt;span class="n"&gt;denominator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;denominator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1e-10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;denominator&lt;/span&gt;
    &lt;span class="n"&gt;intercept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intercept&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Edge cases handled
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;DaysUntilFull&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 2 data points&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;td&gt;Insufficient data, no prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;slope ≤ 0 (shrinking/flat)&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;td&gt;Never fills up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Already over capacity&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Immediate alert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very low usage (0.03%)&lt;/td&gt;
&lt;td&gt;169,374&lt;/td&gt;
&lt;td&gt;Normal — far future prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Live verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"days_until_full"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;169374&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_usage_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_capacity_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1024.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"growth_rate_gb_per_day"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.006&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"forecast_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2490-02-06T06:26:42Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.&lt;/p&gt;

&lt;p&gt;This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat &lt;code&gt;DaysUntilFull&lt;/code&gt; as an early-warning signal, not an exact prediction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-lambda-capacity-forecast.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-lambda-capacity-forecast.png" alt="Capacity Forecast Lambda" width="800" height="1105"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Data Lineage Tracking — DynamoDB with GSI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key&amp;lt;br/&amp;gt;SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index&amp;lt;br/&amp;gt;PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] --&amp;gt;|PK lookup| PK
    Q2[Query by UC + time range] --&amp;gt;|GSI query| GSI
    Q3[Query by execution ARN] --&amp;gt;|Scan + filter| PK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For high-volume environments, consider adding a dedicated GSI on &lt;code&gt;step_functions_execution_arn&lt;/code&gt;. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration helper (opt-in)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.lineage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LineageTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LineageRecord&lt;/span&gt;

&lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LineageTracker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LineageRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;source_file_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/vol1/legal/contracts/deal-001.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;processing_timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-16T14:30:45.123Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_functions_execution_arn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:states:...:execution:...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;uc_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;legal-compliance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://output-bucket/legal/reports/deal-001-analysis.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4523&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lineage_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Design principles
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-blocking&lt;/strong&gt;: Write failures emit a warning log but never interrupt the main processing pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL&lt;/strong&gt;: 365-day auto-expiry via DynamoDB TTL (configurable via &lt;code&gt;LINEAGE_TTL_DAYS&lt;/code&gt; environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opt-in&lt;/strong&gt;: UCs integrate by importing the helper — no mandatory coupling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PAY_PER_REQUEST&lt;/strong&gt;: No capacity planning needed for variable workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future: compliance-grade lineage (v2)
&lt;/h3&gt;

&lt;p&gt;For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future &lt;code&gt;LineageRecord&lt;/code&gt; v2:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;input_checksum&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 of source file for integrity verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;output_checksum&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 of generated output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fpolicy_sequence_number&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ONTAP-assigned sequence for ordering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;policy_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FPolicy policy configuration version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uc_template_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;UC CloudFormation template version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;guardrail_mode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Active guardrail mode at processing time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retention_profile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retention class for compliance tiering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Protobuf TCP Framing — Adaptive Reader
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing &lt;code&gt;read_fpolicy_message()&lt;/code&gt; assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;An adaptive &lt;code&gt;ProtobufFrameReader&lt;/code&gt; that supports three framing modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    A[Incoming TCP Stream] --&amp;gt; B{FramingMode}
    B --&amp;gt;|AUTO_DETECT| C[Probe first 4 bytes]
    C --&amp;gt;|Valid uint32 length| D[LENGTH_PREFIXED]
    C --&amp;gt;|Otherwise| E[FRAMELESS]
    B --&amp;gt;|LENGTH_PREFIXED| D
    B --&amp;gt;|FRAMELESS| E
    D --&amp;gt; F[4-byte big-endian header → payload]
    E --&amp;gt; G[varint-delimited → payload]
    F --&amp;gt; H[Decoded Message]
    G --&amp;gt; H
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Three modes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Wire Format&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LENGTH_PREFIXED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4-byte big-endian length + payload&lt;/td&gt;
&lt;td&gt;XML mode (legacy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FRAMELESS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;varint-delimited protobuf&lt;/td&gt;
&lt;td&gt;Protobuf mode (ONTAP 9.15.1+)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AUTO_DETECT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Probe first bytes, then lock mode&lt;/td&gt;
&lt;td&gt;Unknown/mixed environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Auto-detection heuristic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_auto_detect_and_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Probe first 4 bytes to determine framing mode.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;peek&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readexactly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!I&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;peek&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;candidate_length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max_message_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Valid length header → LENGTH_PREFIXED
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_detected_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LENGTH_PREFIXED&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readexactly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Not a valid length → FRAMELESS (varint-delimited)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_detected_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FRAMELESS&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;peek&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_read_varint_delimited&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Safety features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max message size enforcement&lt;/strong&gt; (default 1 MB): Prevents DoS via malformed length headers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FramingError exception&lt;/strong&gt;: Structured error with offset and raw data for debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful EOF handling&lt;/strong&gt;: Returns &lt;code&gt;None&lt;/code&gt; on connection close without raising&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Integration with existing FPolicy server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.integrations.protobuf_integration&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_fpolicy_reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read_fpolicy_message_v2&lt;/span&gt;

&lt;span class="c1"&gt;# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_fpolicy_reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;read_fpolicy_message_v2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 13 protobuf validation scope
&lt;/h3&gt;

&lt;p&gt;The following questions will be confirmed with NetApp support during live wire validation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)&lt;/li&gt;
&lt;li&gt;Message boundary behavior under high throughput&lt;/li&gt;
&lt;li&gt;Keep-alive behavior in protobuf mode vs XML mode&lt;/li&gt;
&lt;li&gt;Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?&lt;/li&gt;
&lt;li&gt;Mixed-mode migration path (XML → protobuf transition without event loss)&lt;/li&gt;
&lt;li&gt;Maximum message size guidance from ONTAP side&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. SLO Definition — 4 Targets with CloudWatch Dashboard
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;Four SLO targets covering the critical path of the event-driven pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLO&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;SLO met when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Event Ingestion Latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;EventIngestionLatency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P99 &amp;lt; 5,000 ms&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processing Success Rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ProcessingSuccessRate_pct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;gt; 99.5%&lt;/td&gt;
&lt;td&gt;GreaterThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reconnect Time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FPolicyReconnectTime_sec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 sec&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay Completion Time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReplayCompletionTime_sec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 300 sec (5 min)&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For success rate, the CloudWatch Alarm fires when the metric drops &lt;em&gt;below&lt;/em&gt; 99.5% (ComparisonOperator: &lt;code&gt;LessThanThreshold&lt;/code&gt;), even though the SLO target is expressed as "&amp;gt; 99.5%".&lt;/p&gt;

&lt;h3&gt;
  
  
  CloudWatch Dashboard
&lt;/h3&gt;

&lt;p&gt;The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.slo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SLO_TARGETS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate_slos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_dashboard_widgets&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate all SLOs programmatically
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_slos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cloudwatch_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;met&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VIOLATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slo_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (value=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, threshold=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate dashboard widget JSON for CloudFormation
&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_dashboard_widgets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Alarm-based violation detection
&lt;/h3&gt;

&lt;p&gt;Each SLO has a corresponding CloudWatch Alarm:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alarm Name&lt;/th&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-ingestion-latency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-success-rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-reconnect-time&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-replay-completion&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw6nv9al1lm7hzzq8exu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw6nv9al1lm7hzzq8exu.png" alt="SLO Dashboard" width="800" height="710"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. FPolicy Pipeline E2E Verification
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  The verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS-&amp;gt;&amp;gt;ONTAP: echo "test" &amp;gt; /mnt/fpolicy_vol/test.txt
    ONTAP-&amp;gt;&amp;gt;FP: NOTI_REQ (FILE_CREATE event)
    FP-&amp;gt;&amp;gt;FP: Parse event, extract metadata
    FP-&amp;gt;&amp;gt;SQS: SendMessage (JSON payload)
    SQS--&amp;gt;&amp;gt;SQS: Message available for consumers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Timeline (actual observed)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T+0s&lt;/td&gt;
&lt;td&gt;TCP connection test&lt;/td&gt;
&lt;td&gt;ONTAP → Fargate IP (10.0.128.98:9898)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+10s&lt;/td&gt;
&lt;td&gt;Session established&lt;/td&gt;
&lt;td&gt;NEGO_REQ → NEGO_RESP handshake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+12s&lt;/td&gt;
&lt;td&gt;KEEP_ALIVE starts&lt;/td&gt;
&lt;td&gt;2-minute interval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+30s&lt;/td&gt;
&lt;td&gt;NFS file created&lt;/td&gt;
&lt;td&gt;&lt;code&gt;echo "test" &amp;gt; /mnt/fpolicy_vol/test_fpolicy_event.txt&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+31s&lt;/td&gt;
&lt;td&gt;NOTI_REQ received&lt;/td&gt;
&lt;td&gt;FPolicy server receives file creation event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+32s&lt;/td&gt;
&lt;td&gt;SQS delivery&lt;/td&gt;
&lt;td&gt;Event sent to SQS queue (FPolicy_Q)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  SQS message format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FILE_CREATE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"svm_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FSxN_OnPre"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"volume_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vol1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/vol1/test_fpolicy_event.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.0.128.98"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T08:45:32Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sequence_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IAM issue discovered and fixed
&lt;/h3&gt;

&lt;p&gt;The ECS task role's SQS policy used a Resource ARN pattern &lt;code&gt;arn:aws:sqs:...:fsxn-fpolicy-*&lt;/code&gt; that didn't match the actual queue name &lt;code&gt;FPolicy_Q&lt;/code&gt;. Fix: use explicit ARN or &lt;code&gt;*&lt;/code&gt; wildcard in the template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event contract assumptions
&lt;/h3&gt;

&lt;p&gt;The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate events can occur (especially during Persistent Store replay)&lt;/li&gt;
&lt;li&gt;Delivery order is not guaranteed (confirmed in Section 9)&lt;/li&gt;
&lt;li&gt;Consumers must be idempotent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;file_path + timestamp + sequence_number&lt;/code&gt; serves as an idempotency key candidate&lt;/li&gt;
&lt;li&gt;Replay events may arrive after newer events&lt;/li&gt;
&lt;li&gt;Schema versioning should be introduced before multi-UC production rollout&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important prerequisite&lt;/strong&gt;: FPolicy Persistent Store is available for &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;asynchronous non-mandatory policies&lt;/a&gt; only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have &lt;a href="https://docs.netapp.com/us-en/ontap-restapi/protocols_fpolicy_svm.uuid_persistent-stores_endpoint_overview.html" rel="noopener noreferrer"&gt;only one Persistent Store&lt;/a&gt;, and the same store can be used by multiple policies within that SVM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The test procedure
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Stop Fargate task (ECS &lt;code&gt;stop-task&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Create 5 files via NFS during downtime (&lt;code&gt;replay-test-1.txt&lt;/code&gt; through &lt;code&gt;replay-test-5.txt&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Wait for ECS service auto-recovery (new task launch)&lt;/li&gt;
&lt;li&gt;Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)&lt;/li&gt;
&lt;li&gt;Verify all 5 events arrive in SQS&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Events generated during downtime&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Events replayed to SQS&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lost events&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay delivery order&lt;/td&gt;
&lt;td&gt;3, 1, 2, 5, 4 (non-sequential)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay completion time&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key observation: Out-of-order replay
&lt;/h3&gt;

&lt;p&gt;Persistent Store replays events in a &lt;strong&gt;non-sequential order&lt;/strong&gt; — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency&lt;/strong&gt;: Deduplicate by file path + timestamp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp-based ordering&lt;/strong&gt;: Sort by event timestamp, not arrival order&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  20-file burst validation
&lt;/h3&gt;

&lt;p&gt;Additionally, a 20-file burst test confirmed zero event loss under higher load:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Files Created&lt;/th&gt;
&lt;th&gt;Events Delivered&lt;/th&gt;
&lt;th&gt;Loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Replay (5 files)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Burst (20 files)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 replay storm metrics
&lt;/h3&gt;

&lt;p&gt;The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store volume usage before/after replay&lt;/td&gt;
&lt;td&gt;Capacity planning for the store volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Events queued vs events replayed&lt;/td&gt;
&lt;td&gt;Completeness verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay throughput (events/sec)&lt;/td&gt;
&lt;td&gt;Performance baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay duration&lt;/td&gt;
&lt;td&gt;SLO calibration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-order distance&lt;/td&gt;
&lt;td&gt;Downstream buffer sizing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate events&lt;/td&gt;
&lt;td&gt;Idempotency requirement validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP EMS logs around disconnect/reconnect&lt;/td&gt;
&lt;td&gt;Root cause correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Operational framing: event durability as RPO/RTO
&lt;/h3&gt;

&lt;p&gt;Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while &lt;code&gt;ReplayCompletionTime_sec&lt;/code&gt; provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 12 validation scope
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Phase 12 Assumption&lt;/th&gt;
&lt;th&gt;Production Consideration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SVM&lt;/td&gt;
&lt;td&gt;Single SVM validation&lt;/td&gt;
&lt;td&gt;Multi-SVM needs per-SVM policy and Persistent Store planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Volume&lt;/td&gt;
&lt;td&gt;Test volume&lt;/td&gt;
&lt;td&gt;Production volumes should be grouped by UC/event profile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol&lt;/td&gt;
&lt;td&gt;NFS-based E2E test&lt;/td&gt;
&lt;td&gt;NFSv3/NFSv4.1/SMB replay validation remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event types&lt;/td&gt;
&lt;td&gt;File create&lt;/td&gt;
&lt;td&gt;Modify/delete/rename validation remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy mode&lt;/td&gt;
&lt;td&gt;Async non-mandatory&lt;/td&gt;
&lt;td&gt;Required for Persistent Store (&lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;NetApp docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The approach
&lt;/h3&gt;

&lt;p&gt;Using Python's &lt;a href="https://hypothesis.readthedocs.io/" rel="noopener noreferrer"&gt;Hypothesis&lt;/a&gt; library, we defined 16 properties across the Phase 12 modules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property Group&lt;/th&gt;
&lt;th&gt;Properties&lt;/th&gt;
&lt;th&gt;Tests&lt;/th&gt;
&lt;th&gt;Bugs Found&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Protobuf Frame Reader&lt;/td&gt;
&lt;td&gt;5 (round-trip, max size, EOF, multi-message, auto-detect)&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Guardrails&lt;/td&gt;
&lt;td&gt;4 (mode behavior, rate limit, daily cap, cooldown)&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;3 (record/query round-trip, GSI consistency, TTL)&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLO Evaluation&lt;/td&gt;
&lt;td&gt;2 (threshold comparison, no-data handling)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Forecast&lt;/td&gt;
&lt;td&gt;2 (regression accuracy, edge cases)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Bugs discovered
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Protobuf reader&lt;/strong&gt;: &lt;code&gt;AUTO_DETECT&lt;/code&gt; mode failed when the first 4 bytes happened to form a valid-looking length that exceeded &lt;code&gt;max_message_size&lt;/code&gt;. Fix: treat oversized candidate lengths as FRAMELESS indicator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;: &lt;code&gt;BREAK_GLASS&lt;/code&gt; mode didn't emit the &lt;code&gt;GuardrailBypass&lt;/code&gt; metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SLO evaluation&lt;/strong&gt;: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), &lt;code&gt;max(datapoints, key=lambda dp: dp["Timestamp"])&lt;/code&gt; was non-deterministic. Fix: add secondary sort by value.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example property test
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;min_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nd"&gt;@settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_length_prefixed_round_trip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Property: LENGTH_PREFIXED encode → decode preserves all messages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;stream_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_length_prefixed_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_stream_reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;frame_reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtobufFrameReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LENGTH_PREFIXED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_message_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_message&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;  &lt;span class="c1"&gt;# Round-trip property
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The critical finding
&lt;/h3&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points are &lt;strong&gt;not standard S3 endpoints&lt;/strong&gt;. They use the FSx data plane, which has different network routing characteristics than standard S3.&lt;/p&gt;

&lt;p&gt;In this pattern library, FSx for ONTAP S3 Access Points serve as an &lt;strong&gt;AWS service integration boundary&lt;/strong&gt;: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-layer authorization model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    Client[S3 API Client] --&amp;gt; IAM{Layer 1: IAM Policy}
    IAM --&amp;gt;|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP --&amp;gt;|resource policy| FS{Layer 3: File System Identity}
    FS --&amp;gt;|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.-&amp;gt;|❌ Denied| Block1[Access Denied]
    AP -.-&amp;gt;|❌ Denied| Block2[Access Denied]
    FS -.-&amp;gt;|❌ No permission| Block3[Access Denied]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-manage-access-fsxn.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correct IAM ARN format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:&amp;lt;ACCOUNT_ID&amp;gt;:accesspoint/fsxn-eda-s3ap"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:&amp;lt;ACCOUNT_ID&amp;gt;:accesspoint/fsxn-eda-s3ap/object/*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Common mistake&lt;/strong&gt;: Using the S3AP alias (&lt;code&gt;xxx-ext-s3alias&lt;/code&gt;) as a bucket ARN. The alias is only valid as the &lt;code&gt;Bucket&lt;/code&gt; parameter in boto3 calls — IAM policies require the full access point ARN.&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC network constraint (environment-specific observation)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;th&gt;Observed Result&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)&lt;/td&gt;
&lt;td&gt;⚠️ Timeout in this config&lt;/td&gt;
&lt;td&gt;Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internet → S3 AP (NetworkOrigin=Internet)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Routes correctly with valid IAM credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)&lt;/td&gt;
&lt;td&gt;Supported per AWS docs; not verified in Phase 12&lt;/td&gt;
&lt;td&gt;Requires VPC-origin AP and matching endpoint policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → ONTAP REST API&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Direct management LIF access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural implication for this pattern&lt;/strong&gt;: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run outside VPC (with Internet access)&lt;/li&gt;
&lt;li&gt;Use NAT Gateway for outbound routing&lt;/li&gt;
&lt;li&gt;Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Write support and practical constraints
&lt;/h3&gt;

&lt;p&gt;FSx ONTAP S3 Access Points support &lt;code&gt;PutObject&lt;/code&gt;, &lt;code&gt;DeleteObject&lt;/code&gt;, multipart uploads (&lt;code&gt;CreateMultipartUpload&lt;/code&gt;, &lt;code&gt;UploadPart&lt;/code&gt;, &lt;code&gt;CompleteMultipartUpload&lt;/code&gt;), and other write operations — they are not read-only. The &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;access point compatibility table&lt;/a&gt; documents the full list of supported S3 API operations.&lt;/p&gt;

&lt;p&gt;However, S3 Access Points are not full S3 buckets. Key constraints include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum upload size: 5 GB&lt;/li&gt;
&lt;li&gt;Only &lt;code&gt;FSX_ONTAP&lt;/code&gt; storage class&lt;/li&gt;
&lt;li&gt;Only SSE-FSX encryption&lt;/li&gt;
&lt;li&gt;No ACLs (except &lt;code&gt;bucket-owner-full-control&lt;/code&gt;), no Object Versioning, no Object Lock, no presigned URLs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Cross-Project Feedback — Template Hardening
&lt;/h2&gt;

&lt;p&gt;During Phase 12, the companion project &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;fsxn-observability-integrations&lt;/a&gt; reviewed our CloudFormation templates and provided actionable feedback. All items were applied:&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Group: SourceSecurityGroupId over CIDR
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; (broad):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
    &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9898&lt;/span&gt;
    &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9898&lt;/span&gt;
    &lt;span class="na"&gt;CidrIp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.0.0.0/8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt; (precise):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
    &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FPolicyPort&lt;/span&gt;
    &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FPolicyPort&lt;/span&gt;
    &lt;span class="na"&gt;SourceSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FsxnSvmSecurityGroupId&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FPolicy TCP from FSxN SVM Security Group&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  ONTAP CLI: Deprecated &lt;code&gt;vserver&lt;/code&gt; prefix
&lt;/h3&gt;

&lt;p&gt;ONTAP 9.11+ deprecates the &lt;code&gt;vserver&lt;/code&gt; prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deprecated (still works for backward compatibility)&lt;/span&gt;
vserver fpolicy policy external-engine create &lt;span class="nt"&gt;-vserver&lt;/span&gt; FSxN_OnPre ...

&lt;span class="c"&gt;# Recommended (ONTAP 9.11+)&lt;/span&gt;
fpolicy policy external-engine create &lt;span class="nt"&gt;-vserver&lt;/span&gt; FSxN_OnPre ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  KMS Decrypt: When it's needed (and when it's not)
&lt;/h3&gt;

&lt;p&gt;Added documentation clarifying SQS encryption behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SqsManagedSseEnabled: true&lt;/code&gt; → kms:Decrypt is &lt;strong&gt;NOT&lt;/strong&gt; needed (transparent)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;KmsMasterKeyId: alias/aws/sqs&lt;/code&gt; → kms:Decrypt &lt;strong&gt;IS&lt;/strong&gt; needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our templates use &lt;code&gt;SqsManagedSseEnabled: true&lt;/code&gt;, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 AMI: Removed redundant Docker install
&lt;/h3&gt;

&lt;p&gt;ECS-optimized AMIs (&lt;code&gt;{{resolve:ssm:/aws/service/ecs/optimized-ami/...}}&lt;/code&gt;) already include Docker. Removed the unnecessary &lt;code&gt;yum install -y docker&lt;/code&gt; from UserData scripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cpu/Memory: String type is intentional
&lt;/h3&gt;

&lt;p&gt;Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with &lt;code&gt;AllowedValues&lt;/code&gt; provides better validation than Number type for this constrained parameter space.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. What's Next — Phase 13 Outlook
&lt;/h2&gt;

&lt;p&gt;Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Capacity guardrails preventing runaway auto-scaling&lt;/li&gt;
&lt;li&gt;✅ Automated secrets rotation on 90-day cycle&lt;/li&gt;
&lt;li&gt;✅ Proactive capacity forecasting with daily predictions&lt;/li&gt;
&lt;li&gt;✅ SLO-based observability with alarm-driven alerting&lt;/li&gt;
&lt;li&gt;✅ Data lineage tracking for audit and debugging&lt;/li&gt;
&lt;li&gt;✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios&lt;/li&gt;
&lt;li&gt;✅ Property-based testing catching real bugs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ownership boundary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Primary Owner&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shared event platform&lt;/td&gt;
&lt;td&gt;Platform / storage team&lt;/td&gt;
&lt;td&gt;FPolicy server, SQS queue, EventBridge bus, Persistent Store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP operations&lt;/td&gt;
&lt;td&gt;Storage team&lt;/td&gt;
&lt;td&gt;SVM, volume, FPolicy policy, Persistent Store capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security operations&lt;/td&gt;
&lt;td&gt;Security / platform team&lt;/td&gt;
&lt;td&gt;Secrets rotation, BREAK_GLASS approval, IAM policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload UC&lt;/td&gt;
&lt;td&gt;Application / data team&lt;/td&gt;
&lt;td&gt;Step Functions, UC routing rules, output destinations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Platform + workload teams&lt;/td&gt;
&lt;td&gt;SLO dashboard, UC-specific alarms, runbooks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Production Readiness Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Phase 12 Status&lt;/th&gt;
&lt;th&gt;Remaining Work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Guardrails&lt;/td&gt;
&lt;td&gt;Verified (DRY_RUN/ENFORCE/BREAK_GLASS)&lt;/td&gt;
&lt;td&gt;Approval workflow optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Rotation&lt;/td&gt;
&lt;td&gt;4-step rotation verified&lt;/td&gt;
&lt;td&gt;Ensure all clients read from Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLO Dashboard&lt;/td&gt;
&lt;td&gt;Deployed, 4 alarms active&lt;/td&gt;
&lt;td&gt;Runbooks and alarm response automation in Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store Replay&lt;/td&gt;
&lt;td&gt;5-event + 20-event scenarios verified&lt;/td&gt;
&lt;td&gt;1000+ replay storm testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3AP Monitoring&lt;/td&gt;
&lt;td&gt;ONTAP health path verified&lt;/td&gt;
&lt;td&gt;Split S3AP health check (VPC-external)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protobuf Framing&lt;/td&gt;
&lt;td&gt;Property/integration tested&lt;/td&gt;
&lt;td&gt;Live ONTAP protobuf wire validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-account OAM&lt;/td&gt;
&lt;td&gt;Stack deployed conditionally&lt;/td&gt;
&lt;td&gt;Second-account validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production UC E2E&lt;/td&gt;
&lt;td&gt;Pipeline verified to SQS delivery&lt;/td&gt;
&lt;td&gt;Full TriggerMode=EVENT_DRIVEN UC flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Dashboard&lt;/td&gt;
&lt;td&gt;Not yet deployed&lt;/td&gt;
&lt;td&gt;Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 candidates
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Operational readiness&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Canary S3AP check separation&lt;/strong&gt;: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO violation runbooks&lt;/strong&gt;: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay storm testing&lt;/strong&gt;: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Enterprise deployment&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-account OAM validation&lt;/strong&gt;: Deploy workload-account-oam-link.yaml in a second AWS account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared platform vs workload boundary&lt;/strong&gt;: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production UC end-to-end&lt;/strong&gt;: Deploy a UC template with &lt;code&gt;TriggerMode=EVENT_DRIVEN&lt;/code&gt; and verify the complete flow from NFS file creation through Step Functions execution to output generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Protocol and cost&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Protobuf live wire validation&lt;/strong&gt;: Confirm protobuf TCP framing with NetApp support and validate &lt;code&gt;AUTO_DETECT&lt;/code&gt; mode against real ONTAP protobuf traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization dashboard&lt;/strong&gt;: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Decision trees and operational guides&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decision trees&lt;/strong&gt;: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NetApp Partner Delivery Checklist&lt;/strong&gt;: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Cost model awareness
&lt;/h3&gt;

&lt;p&gt;While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Cost Type&lt;/th&gt;
&lt;th&gt;Driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy server (Fargate/EC2)&lt;/td&gt;
&lt;td&gt;Fixed baseline&lt;/td&gt;
&lt;td&gt;Always-on listener&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;Fixed + per-GB&lt;/td&gt;
&lt;td&gt;Required if VPC Lambda needs Internet-origin S3AP access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Synthetics&lt;/td&gt;
&lt;td&gt;Per-canary-run&lt;/td&gt;
&lt;td&gt;5-minute interval = 8,640 runs/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch custom metrics + Logs&lt;/td&gt;
&lt;td&gt;Per-metric + per-GB ingested&lt;/td&gt;
&lt;td&gt;SLO metrics, FPolicy server logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB (lineage + guardrails)&lt;/td&gt;
&lt;td&gt;Per-request (PAY_PER_REQUEST)&lt;/td&gt;
&lt;td&gt;Event volume dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQS / EventBridge&lt;/td&gt;
&lt;td&gt;Per-message / per-event&lt;/td&gt;
&lt;td&gt;Event volume dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store volume&lt;/td&gt;
&lt;td&gt;Per-GB provisioned&lt;/td&gt;
&lt;td&gt;Sized for max queued events during downtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Design decision for new deployments&lt;/strong&gt;: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).&lt;/p&gt;

&lt;h3&gt;
  
  
  NetworkOrigin decision table
&lt;/h3&gt;

&lt;p&gt;Based on &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;AWS documentation&lt;/a&gt;, the following decision criteria apply:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose VPC-origin when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All consumers are Lambda/ECS/EC2 inside the same VPC&lt;/li&gt;
&lt;li&gt;Private connectivity is mandatory (no internet-routed path allowed)&lt;/li&gt;
&lt;li&gt;VPC endpoint policy is part of the security boundary&lt;/li&gt;
&lt;li&gt;Network restriction is built-in (cannot be accidentally misconfigured)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Internet-origin when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External accounts or on-premises clients need access&lt;/li&gt;
&lt;li&gt;Consumers are outside the bound VPC&lt;/li&gt;
&lt;li&gt;Internet-routed access with IAM controls is acceptable&lt;/li&gt;
&lt;li&gt;Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;VPC-origin&lt;/th&gt;
&lt;th&gt;Internet-origin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network enforcement&lt;/td&gt;
&lt;td&gt;Built-in explicit Deny for non-VPC traffic&lt;/td&gt;
&lt;td&gt;Policy-based only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC endpoint required&lt;/td&gt;
&lt;td&gt;Yes (Gateway or Interface in bound VPC)&lt;/td&gt;
&lt;td&gt;Only if using &lt;code&gt;aws:SourceVpc&lt;/code&gt; conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-VPC access&lt;/td&gt;
&lt;td&gt;Via Interface endpoint + peering/TGW to bound VPC&lt;/td&gt;
&lt;td&gt;Via policy conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change access scope&lt;/td&gt;
&lt;td&gt;Must recreate access point&lt;/td&gt;
&lt;td&gt;Update policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises access&lt;/td&gt;
&lt;td&gt;Via Interface endpoint in bound VPC&lt;/td&gt;
&lt;td&gt;Direct with IAM credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost implication&lt;/td&gt;
&lt;td&gt;VPC endpoint (Gateway=free, Interface=hourly)&lt;/td&gt;
&lt;td&gt;NAT Gateway if VPC Lambda needs access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 12 readiness by workload type
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Phase 12 Ready?&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Controlled PoC / single-account&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;All core components verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low/moderate event volume (&amp;lt; 100 events/day)&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;20-event burst validated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DRY_RUN guardrail validation&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;Safe to deploy immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets rotation validation&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;4-step rotation verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-volume replay storm (1000+ events)&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Throughput curve and store capacity not yet measured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-account production&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;OAM link deployed but second-account validation pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strict SLO operations requiring runbooks&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Dashboard deployed, runbooks not yet written&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live protobuf production mode&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Wire validation with NetApp support pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full EVENT_DRIVEN UC end-to-end&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Pipeline verified to SQS, Step Functions flow pending&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 runbook scope: first-response diagnostic bundle
&lt;/h3&gt;

&lt;p&gt;For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# FPolicy status&lt;/span&gt;
fpolicy show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt; &lt;span class="nt"&gt;-fields&lt;/span&gt; policy-name,status
fpolicy policy external-engine show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;
fpolicy persistent-store show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;

&lt;span class="c"&gt;# Connection and event state&lt;/span&gt;
fpolicy show-engine &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;
fpolicy show-passthrough-read-connection &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;

&lt;span class="c"&gt;# EMS logs for FPolicy events&lt;/span&gt;
event log show &lt;span class="nt"&gt;-messagename&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;fpolicy&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployed Infrastructure
&lt;/h2&gt;

&lt;p&gt;7 CloudFormation stacks deployed and verified:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-guardrails-table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;DynamoDB tracking table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-lineage-table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Data lineage DynamoDB + GSI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-slo-dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;CloudWatch dashboard + 4 alarms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-oam-link&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-capacity-forecast&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Lambda + EventBridge schedule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-secrets-rotation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;VPC Lambda + rotation config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-synthetic-monitoring&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcihlbv8m756gbdrr1o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcihlbv8m756gbdrr1o.png" alt="CloudFormation Stacks" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unit Tests&lt;/td&gt;
&lt;td&gt;116&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property Tests (Hypothesis)&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFormation Deployments&lt;/td&gt;
&lt;td&gt;7 stacks&lt;/td&gt;
&lt;td&gt;AWS integration&lt;/td&gt;
&lt;td&gt;✅ All CREATE_COMPLETE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Invocations&lt;/td&gt;
&lt;td&gt;2 (forecast + rotation)&lt;/td&gt;
&lt;td&gt;AWS integration&lt;/td&gt;
&lt;td&gt;✅ Successful&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy E2E&lt;/td&gt;
&lt;td&gt;1 pipeline test&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Event delivered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay E2E&lt;/td&gt;
&lt;td&gt;5 events&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Zero loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-file burst&lt;/td&gt;
&lt;td&gt;20 events&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Zero loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bugs found (property testing)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  NetApp-Specific Takeaways
&lt;/h2&gt;

&lt;p&gt;For NetApp users and partners evaluating this pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy Persistent Store&lt;/strong&gt; works as the durability layer for asynchronous non-mandatory FPolicy policies (&lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;NetApp docs&lt;/a&gt;), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Points for FSx for ONTAP&lt;/strong&gt; are not standard S3 buckets: they support &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;selected S3 API operations&lt;/a&gt; including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NetworkOrigin is a design-time decision&lt;/strong&gt;. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP-common vs AWS-specific&lt;/strong&gt;: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational readiness&lt;/strong&gt; requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.&lt;/p&gt;

&lt;p&gt;The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.&lt;/p&gt;

&lt;p&gt;With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Previous phases&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;Phase 1&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/public-sector-use-cases-unified-output-destination-and-a-localization-batch-fsx-for-ontap-s3-2hmo"&gt;Phase 7&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/operational-hardening-ci-grade-validation-and-pattern-c-b-hybrid-fsx-for-ontap-s3-access-587h"&gt;Phase 8&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/production-rollout-vpc-endpoint-auto-detection-and-the-cdk-no-go-fsx-for-ontap-s3-access-3lni"&gt;Phase 9&lt;/a&gt; · &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt; · &lt;a href="https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8"&gt;Phase 11&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>amazonfsxfornetappontap</category>
      <category>s3accesspoints</category>
    </item>
    <item>
      <title>Event-Driven Ransomware Detection with ONTAP ARP + Datadog</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 09:16:54 +0000</pubDate>
      <link>https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda</link>
      <guid>https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Threat Model
&lt;/h2&gt;

&lt;p&gt;Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.&lt;/p&gt;

&lt;p&gt;What if your &lt;em&gt;storage&lt;/em&gt; could detect the encryption pattern before the host-based tools react?&lt;/p&gt;

&lt;p&gt;That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sudden spikes in file entropy (encryption)&lt;/li&gt;
&lt;li&gt;Mass file extension changes (&lt;code&gt;.docx&lt;/code&gt; → &lt;code&gt;.encrypted&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Abnormal write patterns inconsistent with normal workload behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When ARP detects an attack, it changes the volume state to &lt;code&gt;attack-detected&lt;/code&gt; and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Detection Pipeline
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt;, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ONTAP ARP detects encryption behavior
    │
    ▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
    │
    ▼
API Gateway (REST endpoint)
    │
    ▼
Lambda (EMS handler)
    │
    ▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
    │
    ▼
Datadog Monitor → PagerDuty / Slack / Email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End-to-end latency: &lt;strong&gt;around 30 seconds&lt;/strong&gt; in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.&lt;/p&gt;

&lt;p&gt;Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production security note&lt;/strong&gt;: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Before production use, review API Gateway authorization, source IP restrictions, WAF, resource policies, IAM authorization, or a Lambda authorizer. Use the repository’s Security Review Checklist and Webhook Security Guide for the production baseline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Deploying the EMS Integration
&lt;/h2&gt;

&lt;p&gt;The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template-ems-fpolicy.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-ems-fpolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Gets Created
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EMS Lambda&lt;/td&gt;
&lt;td&gt;Receives EMS webhooks, normalizes, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy Lambda&lt;/td&gt;
&lt;td&gt;Receives FPolicy events from SQS, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Gateway (from shared EMS webhook stack)&lt;/td&gt;
&lt;td&gt;HTTPS endpoint for ONTAP EMS webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM Roles&lt;/td&gt;
&lt;td&gt;Least-privilege for each Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Log Groups&lt;/td&gt;
&lt;td&gt;Execution logs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Webhook Security
&lt;/h3&gt;

&lt;p&gt;For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.&lt;/p&gt;

&lt;h3&gt;
  
  
  ONTAP EMS Configuration
&lt;/h3&gt;

&lt;p&gt;After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include &lt;code&gt;arw.volume.state&lt;/code&gt; and other &lt;code&gt;arw.*&lt;/code&gt; events you want to monitor. Refer to the &lt;a href="https://docs.netapp.com/us-en/ontap/error-messages/configure-webhooks-event-notifications-task.html" rel="noopener noreferrer"&gt;NetApp EMS webhook documentation&lt;/a&gt; for destination and filter configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The EMS Lambda Handler
&lt;/h2&gt;

&lt;p&gt;The handler receives an API Gateway proxy event containing the EMS webhook payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Process EMS webhook from ONTAP via API Gateway.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_request_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMS handler invoked: requestId=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract EMS events from webhook body
&lt;/span&gt;    &lt;span class="n"&gt;ems_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parsed %d EMS event(s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Normalize to common schema
&lt;/span&gt;    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Format for Datadog
&lt;/span&gt;    &lt;span class="n"&gt;dd_logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_format_for_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Ship to Datadog
&lt;/span&gt;    &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_ship_to_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dd_logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_api_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMS events processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shipped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EMS Event Normalization
&lt;/h3&gt;

&lt;p&gt;ONTAP EMS events arrive with fields like &lt;code&gt;messageName&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, &lt;code&gt;node&lt;/code&gt;, &lt;code&gt;svmName&lt;/code&gt;, &lt;code&gt;parameters&lt;/code&gt;. The handler normalizes them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Normalize raw EMS events to internal schema.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messageName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svmName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Datadog Formatting (source:fsxn-ems)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_format_for_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Format normalized EMS events for Datadog Logs API v2.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;dd_logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;dd_logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ddsource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fsxn-ems&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ddtags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source:fsxn-ems,service:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DD_SERVICE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,env:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DD_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hostname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DD_SERVICE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attributes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dd_logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ARP Event Payload (Normalized by Lambda)
&lt;/h2&gt;

&lt;p&gt;ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's &lt;code&gt;_extract_ems_events()&lt;/code&gt; function parses the incoming API Gateway proxy event body, then &lt;code&gt;_normalize_ems_events()&lt;/code&gt; produces the following internal schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arw.volume.state"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_node"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fsxn-node-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"svm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"svm-prod-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-17T01:04:22Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Anti-ransomware: Volume vol_data state changed to attack-detected"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"volume_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vol_data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"attack-detected"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Datadog, this arrives as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-ems&lt;/span&gt;
&lt;span class="py"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-node-01&lt;/span&gt;
&lt;span class="py"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-ontap&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.event_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;arw.volume.state&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.svm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;svm-prod-01&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.parameters.volume_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;vol_data&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.parameters.state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;attack-detected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6938t9vha04oxhtypb5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6938t9vha04oxhtypb5q.png" alt="ARP event in Datadog Log Explorer" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffptforpw8q8ts2uiuyw3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffptforpw8q8ts2uiuyw3.png" alt="ARP event detail panel" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up the Datadog Monitor
&lt;/h2&gt;

&lt;p&gt;Create a Monitor that triggers on any ARP alert:&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Log Explorer search query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Datadog Monitor API JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"🚨 FSx for ONTAP: Ransomware Detected (ARP)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"log alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"logs(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).index(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;*&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).rollup(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).last(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;5m&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;) &amp;gt; 0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;**Volume**: {{attributes.parameters.volume_name}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**SVM**: {{attributes.svm}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**Node**: {{host}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**Time**: {{date}}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;## Recommended Actions&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;1. Verify the ARP event in ONTAP and Datadog.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;2. Check FPolicy/audit logs for user/client IP correlation.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;@pagerduty @slack-security-alerts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notify_no_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_delay"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What This Monitor Does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggers on&lt;/strong&gt;: Any &lt;code&gt;arw.volume.state&lt;/code&gt; event with &lt;code&gt;state:attack-detected&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold&lt;/strong&gt;: Critical when count &amp;gt; 0 in a 5-minute window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notification&lt;/strong&gt;: PagerDuty + Slack with volume name, SVM, and response steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-data handling&lt;/strong&gt;: Disabled (absence of ARP events is normal)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Adjust template variables (&lt;code&gt;{{attributes.*}}&lt;/code&gt;, &lt;code&gt;{{host}}&lt;/code&gt;, &lt;code&gt;{{date}}&lt;/code&gt;) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  FPolicy: The Complementary Signal
&lt;/h2&gt;

&lt;p&gt;While ARP detects the encryption &lt;em&gt;pattern&lt;/em&gt;, FPolicy provides the file-level &lt;em&gt;detail&lt;/em&gt;. Together they answer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is ransomware active?&lt;/td&gt;
&lt;td&gt;ARP (EMS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which files are affected?&lt;/td&gt;
&lt;td&gt;FPolicy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who is doing it?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;user&lt;/code&gt; field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;From where?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;client_ip&lt;/code&gt; field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What operations?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;operation&lt;/code&gt;: create, write, rename, delete)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  FPolicy Event in Datadog
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-fpolicy&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;create&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;/vol/data/finance/confidential_report.xlsx&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;suspicious_user@corp.local&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.client_ip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;10.0.1.55&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;cifs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda2xm6pmhu8bqw5883pj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda2xm6pmhu8bqw5883pj.png" alt="FPolicy events in Datadog" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlation Query
&lt;/h3&gt;

&lt;p&gt;After an ARP alert, investigate with FPolicy data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows all file modifications on the affected SVM, helping identify the responsible user and client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident Response Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;1.&lt;/span&gt; ARP fires → EMS webhook → Datadog alert (around 30 seconds)
     │
&lt;span class="p"&gt;2.&lt;/span&gt; Responder receives PagerDuty/Slack notification
     │
&lt;span class="p"&gt;3.&lt;/span&gt; Verify in Datadog and ONTAP:
&lt;span class="p"&gt;   -&lt;/span&gt; source:fsxn-ems → confirm ARP event details
&lt;span class="p"&gt;   -&lt;/span&gt; source:fsxn-fpolicy → identify user, IP, affected files
&lt;span class="p"&gt;   -&lt;/span&gt; ONTAP: security anti-ransomware volume show
     │
&lt;span class="p"&gt;4.&lt;/span&gt; Correlate and assess:
&lt;span class="p"&gt;   -&lt;/span&gt; Is this a true positive or legitimate bulk operation?
&lt;span class="p"&gt;   -&lt;/span&gt; What is the blast radius (volumes, files, users)?
     │
&lt;span class="p"&gt;5.&lt;/span&gt; Containment (only after verification, per approved runbook):
&lt;span class="p"&gt;   -&lt;/span&gt; Create snapshot (preserve recovery point)
&lt;span class="p"&gt;   -&lt;/span&gt; Restrict volume access if confirmed malicious
&lt;span class="p"&gt;   -&lt;/span&gt; Review ARP suspect list
     │
&lt;span class="p"&gt;6.&lt;/span&gt; Recovery:
&lt;span class="p"&gt;   -&lt;/span&gt; Restore from snapshot (pre-attack state)
&lt;span class="p"&gt;   -&lt;/span&gt; Re-enable access after containment
&lt;span class="p"&gt;   -&lt;/span&gt; Update audit policies if gaps found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For a more detailed role-based runbook, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/arp-incident-response-guide.md" rel="noopener noreferrer"&gt;ARP Incident Response Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond ARP: Other EMS Use Cases
&lt;/h2&gt;

&lt;p&gt;The same EMS webhook pipeline handles other critical ONTAP events:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EMS Event&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;arw.volume.state&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;alert&lt;/td&gt;
&lt;td&gt;Ransomware detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wafl.quota.softlimit.exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;warning&lt;/td&gt;
&lt;td&gt;Capacity planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wafl.quota.hardlimit.exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;Immediate capacity action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cf.fsm.takeover&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;alert&lt;/td&gt;
&lt;td&gt;HA failover notification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sms.vol.full&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;Volume full — data at risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;net.linkDown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;warning&lt;/td&gt;
&lt;td&gt;Network connectivity issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All arrive in Datadog as &lt;code&gt;source:fsxn-ems&lt;/code&gt; with the event name in &lt;code&gt;@attributes.event_name&lt;/code&gt;, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/normalized-event-schema.md" rel="noopener noreferrer"&gt;Normalized Event Schema&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validation Results
&lt;/h2&gt;

&lt;p&gt;This integration was validated end-to-end:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ARP event → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quota exceeded → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy file create → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived (via SQS → Lambda path)&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda error handling&lt;/td&gt;
&lt;td&gt;✅ DLQ capture&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API key from Secrets Manager&lt;/td&gt;
&lt;td&gt;✅ Cached&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Validation performed in ap-northeast-1 with the deployed &lt;code&gt;fsxn-datadog-ems-fpolicy&lt;/code&gt; stack.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaw8jtsuq80xpwmutw7y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaw8jtsuq80xpwmutw7y.png" alt="Lambda execution logs" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Considerations for Security Teams
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Webhook security&lt;/strong&gt;: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detection latency&lt;/strong&gt;: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False positives&lt;/strong&gt;: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage&lt;/strong&gt;: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify &lt;code&gt;security anti-ransomware volume show&lt;/code&gt; before relying on alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail&lt;/strong&gt;: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost profile&lt;/strong&gt;: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;If you want the shortest path to a first successful ARP alert test, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/quick-start-minimum.md" rel="noopener noreferrer"&gt;minimum quick start&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy EMS + FPolicy integration&lt;/span&gt;
aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template-ems-fpolicy.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-ems-fpolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-secret-arn&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM

&lt;span class="c"&gt;# Create a test event file&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; arp-test-event.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
{
  "body": "{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;messageName&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;arw.volume.state&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;severity&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;alert&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;node&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;fsxn-node-01&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;svmName&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;svm-prod-01&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;time&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%SZ&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;Anti-ransomware: Volume vol_data state changed to attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;parameters&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;volume_name&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;vol_data&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;state&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;}}",
  "requestContext": {"requestId": "test"}
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Invoke Lambda with the test event&lt;/span&gt;
aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; fsxn-datadog-ems-fpolicy-ems &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; file://arp-test-event.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-binary-format&lt;/span&gt; raw-in-base64-out &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1 &lt;span class="se"&gt;\&lt;/span&gt;
  arp-test-output.json

&lt;span class="c"&gt;# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This completes the Datadog series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: Architecture and project introduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: Audit log pipeline implementation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Event-driven ransomware detection (this post)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Coming up next in the series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Splunk&lt;/strong&gt;: Replacing EC2 + Universal Forwarder with Lambda + HEC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: The vendor-neutral escape hatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Cloud&lt;/strong&gt;: Loki Push API with label cardinality guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each will follow the same pattern: deploy, validate, document the gotchas.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about ARP detection or the EMS pipeline? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Shipping FSx for ONTAP Logs to Datadog — The Serverless Way</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 09:16:31 +0000</pubDate>
      <link>https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c</link>
      <guid>https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Part 1&lt;/a&gt;, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.&lt;/p&gt;

&lt;p&gt;By the end of this post, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms&lt;/li&gt;
&lt;li&gt;ONTAP audit events flowing into Datadog Log Explorer&lt;/li&gt;
&lt;li&gt;Structured attributes (&lt;code&gt;@attributes.svm&lt;/code&gt;, &lt;code&gt;@attributes.user&lt;/code&gt;, &lt;code&gt;@attributes.operation&lt;/code&gt;, &lt;code&gt;@attributes.path&lt;/code&gt;, &lt;code&gt;@attributes.client_ip&lt;/code&gt;, &lt;code&gt;@attributes.result&lt;/code&gt;) ready for search, filtering, and Datadog facet creation&lt;/li&gt;
&lt;li&gt;An operational CloudWatch dashboard monitoring pipeline health&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before deploying, you need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP file system&lt;/strong&gt; with an SVM configured for audit logging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP S3 Access Point&lt;/strong&gt; attached to the audit volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog account&lt;/strong&gt; (free trial works) with an API Key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Key in Secrets Manager&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"api_key":"&amp;lt;your-dd-api-key&amp;gt;"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP audit logging enabled&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Time-based rotation for quick validation&lt;/span&gt;
vserver audit create &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;svm-name&amp;gt; &lt;span class="nt"&gt;-destination&lt;/span&gt; /audit_log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-events&lt;/span&gt; file-ops &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-format&lt;/span&gt; evtx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-rotate-schedule-minute&lt;/span&gt; 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;svm-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;For quick validation, use time-based rotation. If you only use &lt;code&gt;-rotate-size&lt;/code&gt;, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the &lt;code&gt;-events&lt;/code&gt; list based on what you want to audit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Enabling &lt;code&gt;vserver audit&lt;/code&gt; is only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/ontap-audit-setup.md" rel="noopener noreferrer"&gt;ONTAP Audit Setup Guide&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/operational-guide.md" rel="noopener noreferrer"&gt;Operational Guide&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify how audit files appear via S3 API&lt;/strong&gt; (to set &lt;code&gt;AuditLogPrefix&lt;/code&gt; correctly):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3api list-objects-v2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; &amp;lt;fsx-s3-access-point-arn-or-alias&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-keys&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;AuditLogPrefix&lt;/code&gt; to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: &lt;code&gt;/audit_log&lt;/code&gt; is the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify with &lt;code&gt;list-objects-v2&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The CloudFormation Stack
&lt;/h2&gt;

&lt;p&gt;The Datadog integration deploys as a single self-contained stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FsxS3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;AuditLogPrefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;prefix-from-list-objects-v2&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ScheduleRate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"rate(5 minutes)"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Gets Created
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Function&lt;/td&gt;
&lt;td&gt;Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge Scheduler&lt;/td&gt;
&lt;td&gt;Invokes Lambda every 5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler IAM Role&lt;/td&gt;
&lt;td&gt;Allows Scheduler to invoke Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Execution Role&lt;/td&gt;
&lt;td&gt;S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dead Letter Queue (SQS)&lt;/td&gt;
&lt;td&gt;Captures failed events for replay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Alarms (3)&lt;/td&gt;
&lt;td&gt;Errors, throttles, DLQ depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Dashboard&lt;/td&gt;
&lt;td&gt;Operational health: errors, duration, invocations, DLQ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Log Group&lt;/td&gt;
&lt;td&gt;Lambda execution logs (30-day retention)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Parameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Required&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FsxS3AccessPointArn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;FSx for ONTAP S3 Access Point ARN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DatadogApiKeySecretArn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Secrets Manager ARN for the API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DatadogSite&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Datadog site (default: &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ScheduleRate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Processing frequency (default: &lt;code&gt;rate(5 minutes)&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AuditLogPrefix&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VpcEnabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Enable VPC config — requires NAT Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Lambda Handler
&lt;/h2&gt;

&lt;p&gt;The handler follows a straightforward flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scheduled invocation
  → List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
  → Filter by checkpoint (skip already-processed files)
  → For each new file:
      → Read via S3 GetObject
      → Detect format (EVTX magic bytes or XML declaration)
      → Parse into normalized events
      → Format for Datadog Logs API v2
      → Batch (≤5MB, ≤1000 items per request)
      → Ship with exponential backoff (max 3 attempts)
  → Update checkpoint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Datadog API Limits
&lt;/h3&gt;

&lt;p&gt;The Datadog Logs API v2 enforces the following per-request limits (&lt;a href="https://docs.datadoghq.com/api/latest/logs/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum payload size (uncompressed): &lt;strong&gt;5MB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Maximum size for a single log: &lt;strong&gt;1MB&lt;/strong&gt; (larger logs are truncated, not rejected)&lt;/li&gt;
&lt;li&gt;Maximum array size: &lt;strong&gt;1000 entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shipper batches conservatively below these limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Shipping Logic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_ship_to_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Ship normalized logs to Datadog Logs Intake API v2.

    If any batch fails after retries, raise an exception so the Lambda
    invocation is treated as failed and the checkpoint is not advanced.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;failed_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_create_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_send_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;failed_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failed_batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed_batches&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; batch(es) failed after retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;shipped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Checkpoint Semantics
&lt;/h3&gt;

&lt;p&gt;The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.&lt;/p&gt;

&lt;p&gt;This makes the pipeline &lt;strong&gt;at-least-once&lt;/strong&gt;: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Retry with Exponential Backoff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_send_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;DATADOG_LOGS_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD-API-KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# jitter
&lt;/span&gt;            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="c1"&gt;# Client error (4xx) — don't retry
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation uses exponential backoff with jitter (&lt;code&gt;2^attempt + random&lt;/code&gt;) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that &lt;code&gt;MAX_RETRIES&lt;/code&gt; in the code represents the total number of attempts, not retries after an initial attempt.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Key Caching
&lt;/h3&gt;

&lt;p&gt;The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_api_key_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;API_KEY_SECRET_ARN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;_api_key_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dd_api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Datadog Field Mapping
&lt;/h2&gt;

&lt;p&gt;Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (&lt;code&gt;ddsource&lt;/code&gt;, &lt;code&gt;hostname&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;) and custom attributes nested under &lt;code&gt;attributes&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Datadog Log Explorer&lt;/th&gt;
&lt;th&gt;Payload Field&lt;/th&gt;
&lt;th&gt;ONTAP Source&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ddsource&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configured&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configured&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn-ontap&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;host&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hostname&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SVM name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SVMName / Computer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;UserName / SubjectUserName&lt;/td&gt;
&lt;td&gt;&lt;code&gt;admin@corp.local&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ClientIP / IpAddress&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.50&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operation / ObjectType&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReadData&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ObjectName&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/vol/data/reports/q4.xlsx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Result / Keywords&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Success&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.event_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.event_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EventID&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4663&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes._pipeline.processed_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes._pipeline.processed_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lambda timestamp&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-05-17T01:30:00Z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes._pipeline.source_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes._pipeline.source_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;S3 object key&lt;/td&gt;
&lt;td&gt;&lt;code&gt;audit_log/audit_svm_20260517.evtx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Set &lt;code&gt;DatadogSite&lt;/code&gt; to your Datadog site, such as &lt;code&gt;datadoghq.com&lt;/code&gt; (US1), &lt;code&gt;datadoghq.eu&lt;/code&gt; (EU1), or &lt;code&gt;ap1.datadoghq.com&lt;/code&gt; (AP1/Tokyo). The site determines the API endpoint.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/normalized-event-schema.md" rel="noopener noreferrer"&gt;Normalized Event Schema&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Datadog Search Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# All FSx for ONTAP audit events
source:fsxn

# Failed access attempts
source:fsxn @attributes.result:Failure

# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"

# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"

# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigation Query Starters
&lt;/h3&gt;

&lt;p&gt;When investigating an incident, start with these patterns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Search query&lt;/th&gt;
&lt;th&gt;Then group by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What did this user do?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.user:"suspect@corp.local"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@attributes.operation&lt;/code&gt; or &lt;code&gt;@attributes.path&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who accessed this file?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.path:"/vol/data/secret.pdf"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which clients generated failures?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.result:Failure&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where are deletes concentrated?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.operation:delete&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@attributes.path&lt;/code&gt; or a path prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What happened on this SVM in the last hour?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.svm:svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Operational Validation
&lt;/h2&gt;

&lt;p&gt;For a structured PoC sign-off, use the repository’s PoC Success Criteria document. It defines minimum success, operational success, and production-readiness gates across audit logs, EMS, FPolicy, and multi-backend patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Validation (5–10 minutes)
&lt;/h3&gt;

&lt;p&gt;With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.&lt;/p&gt;

&lt;p&gt;Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 0. Get stack outputs (log group name, DLQ URL, etc.)&lt;/span&gt;
aws cloudformation describe-stacks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Stacks[0].Outputs'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1

&lt;span class="c"&gt;# 1. Confirm Scheduler is invoking Lambda&lt;/span&gt;
aws logs filter-log-events &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &amp;lt;LambdaLogGroupName from outputs&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import time; print(int((time.time()-300)*1000))"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1

&lt;span class="c"&gt;# 2. Confirm DLQ is empty&lt;/span&gt;
aws sqs get-queue-attributes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--queue-url&lt;/span&gt; &amp;lt;dlq-url&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-names&lt;/span&gt; All &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Attributes.ApproximateNumberOfMessages'&lt;/span&gt;

&lt;span class="c"&gt;# 3. Search in Datadog&lt;/span&gt;
&lt;span class="c"&gt;#    source:fsxn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CloudWatch Dashboard
&lt;/h3&gt;

&lt;p&gt;The stack includes a pre-built dashboard (&lt;code&gt;fsxn-datadog-integration-health&lt;/code&gt;) with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda Errors &amp;amp; Throttles&lt;/li&gt;
&lt;li&gt;Lambda Duration (avg/max)&lt;/li&gt;
&lt;li&gt;Lambda Invocations&lt;/li&gt;
&lt;li&gt;DLQ Depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Watch For
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No logs in Datadog&lt;/td&gt;
&lt;td&gt;Scheduler not running, or no new audit files&lt;/td&gt;
&lt;td&gt;Check CloudWatch Logs for Lambda invocations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs arrive but fields are empty&lt;/td&gt;
&lt;td&gt;EVTX/XML parsing issue&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;@attributes.event_type&lt;/code&gt; — if "unknown", parser needs tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DLQ messages appearing&lt;/td&gt;
&lt;td&gt;Datadog API rejection&lt;/td&gt;
&lt;td&gt;Check API key validity, site configuration, timestamp age&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda timeout&lt;/td&gt;
&lt;td&gt;S3 AP access issue (VPC Gateway EP?)&lt;/td&gt;
&lt;td&gt;Verify NAT Gateway or deploy Lambda outside VPC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Old Timestamps May Not Appear in Log Explorer
&lt;/h3&gt;

&lt;p&gt;The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gzip Compression Issue (AP1 Site)
&lt;/h3&gt;

&lt;p&gt;During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The &lt;code&gt;ENABLE_GZIP&lt;/code&gt; parameter defaults to &lt;code&gt;false&lt;/code&gt; for this reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 Access Point Timeout in VPC
&lt;/h3&gt;

&lt;p&gt;If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day-2 Operations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DLQ Replay
&lt;/h3&gt;

&lt;p&gt;This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), &lt;code&gt;sqs start-message-move-task&lt;/code&gt; cannot redrive messages automatically.&lt;/p&gt;

&lt;p&gt;For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inspect failed messages&lt;/span&gt;
aws sqs receive-message &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--queue-url&lt;/span&gt; &amp;lt;dlq-url&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-number-of-messages&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-names&lt;/span&gt; All &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--message-attribute-names&lt;/span&gt; All
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; &amp;lt;lambda-function-name&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-binary-format&lt;/span&gt; raw-in-base64-out &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1 &lt;span class="se"&gt;\&lt;/span&gt;
  replay-output.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Checkpoint Reset (Reprocess All Files)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Warning&lt;/strong&gt;: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws dynamodb delete-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; fsxn-observability-audit-checkpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s1"&gt;'{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Teardown
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation delete-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Cost Estimate
&lt;/h2&gt;

&lt;p&gt;For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda (288 invocations/day × 5s avg)&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge Scheduler&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB (checkpoint)&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Manager&lt;/td&gt;
&lt;td&gt;~$0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Logs (30-day)&lt;/td&gt;
&lt;td&gt;~$1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway (if VPC)&lt;/td&gt;
&lt;td&gt;Region-dependent hourly + per-GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total (no VPC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total (with VPC/NAT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$30–50+/month depending on Region&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the &lt;a href="https://calculator.aws/" rel="noopener noreferrer"&gt;AWS Pricing Calculator&lt;/a&gt; for your target Region.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence retention&lt;/strong&gt;: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost control&lt;/strong&gt;: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3&lt;/a&gt;, we'll add event-driven security alerting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONTAP Autonomous Ransomware Protection (ARP) detection&lt;/li&gt;
&lt;li&gt;EMS webhook → API Gateway → Lambda → Datadog&lt;/li&gt;
&lt;li&gt;Datadog Monitor configuration for instant alerts&lt;/li&gt;
&lt;li&gt;Incident response workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions about the Datadog integration? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Next: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 09:14:20 +0000</pubDate>
      <link>https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod</link>
      <guid>https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;FSx for ONTAP file access audit logs are usually consumed through EC2-based patterns — mounted audit volumes and agent-based forwarders such as Splunk Universal Forwarder. This series explores an EC2-free alternative: configure ONTAP to write audit logs to an audit volume, expose that volume through an FSx for ONTAP S3 Access Point, use EventBridge Scheduler to invoke Lambda, and ship normalized events to observability platforms such as Datadog, Splunk, New Relic, Grafana Cloud, Elastic, and OpenTelemetry-compatible backends.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Post Covers
&lt;/h2&gt;

&lt;p&gt;This post introduces the architecture and the open-source pattern library. It does not yet cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full Datadog deployment walkthrough (Part 2)&lt;/li&gt;
&lt;li&gt;Vendor-specific field mappings&lt;/li&gt;
&lt;li&gt;Cost/performance benchmarking&lt;/li&gt;
&lt;li&gt;ARP + EMS webhook + Datadog alerting (Part 3)&lt;/li&gt;
&lt;li&gt;FPolicy binary protocol internals (future post)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;You're running Amazon FSx for NetApp ONTAP. You've enabled file access auditing because compliance requires it — or because you genuinely want to know who's accessing what on your file shares.&lt;/p&gt;

&lt;p&gt;But where do those audit logs go?&lt;/p&gt;

&lt;p&gt;If you followed the &lt;a href="https://aws.amazon.com/blogs/storage/auditing-user-and-administrative-actions-on-amazon-fsx-for-netapp-ontap-using-splunk/" rel="noopener noreferrer"&gt;official AWS blog post&lt;/a&gt;, you likely ended up with EC2-based collectors: syslog-ng for cluster/admin audit forwarding, and a mounted audit volume plus Splunk Universal Forwarder for file access audit logs. It works. But now you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2 instances to patch and maintain&lt;/li&gt;
&lt;li&gt;NFS mounts to the audit volume&lt;/li&gt;
&lt;li&gt;syslog-ng configuration for admin audit forwarding&lt;/li&gt;
&lt;li&gt;Splunk Universal Forwarder configuration for file access logs&lt;/li&gt;
&lt;li&gt;A single point of failure unless you build your own HA pattern&lt;/li&gt;
&lt;li&gt;Vendor lock-in to Splunk's agent-based model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What if you could replace that EC2-based collector pattern with managed services — Lambda reads audit logs via S3 APIs, no NFS mount required — and ship to &lt;em&gt;any&lt;/em&gt; observability platform?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the goal of this project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Distinction: Two Types of ONTAP Audit
&lt;/h2&gt;

&lt;p&gt;Before diving in, a clarification. FSx for ONTAP has two distinct audit mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cluster/admin activity audit logs&lt;/strong&gt; — Administrative operations (CLI/API commands). These are forwarded via syslog to a log destination, as described in the &lt;a href="https://aws.amazon.com/blogs/storage/auditing-user-and-administrative-actions-on-amazon-fsx-for-netapp-ontap-using-splunk/" rel="noopener noreferrer"&gt;AWS blog&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;File access audit logs&lt;/strong&gt; — SMB/NFS file operations (open, read, write, delete, permission changes). These are recorded based on ONTAP audit policies and SACLs/NFSv4 ACLs, stored in EVTX or XML format, depending on your ONTAP audit configuration, on an audit volume inside the SVM.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In this series, "audit logs" refers to file access audit logs (type 2).&lt;/strong&gt; The cluster admin audit forwarding via syslog is a separate concern.&lt;/p&gt;

&lt;h2&gt;
  
  
  The EC2-Free Alternative
&lt;/h2&gt;

&lt;p&gt;I'm building an open-source pattern library that targets 9 observability vendors using Lambda, EventBridge Scheduler, and ECS Fargate — eliminating the need for self-managed EC2 instances.&lt;/p&gt;

&lt;p&gt;This is EC2-free, not necessarily Lambda-only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;audit-log and EMS paths&lt;/strong&gt; are Lambda patterns (scheduled and event-driven respectively).&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;FPolicy path&lt;/strong&gt; uses ECS Fargate because ONTAP FPolicy requires a persistent TCP listener.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Audit Logs Flow
&lt;/h2&gt;

&lt;p&gt;ONTAP's file access auditing writes rotated audit log files to a configured destination path inside the SVM. In this project, that destination is an audit volume exposed through an FSx for ONTAP S3 Access Point. Lambda does not mount NFS or SMB; it reads the rotated audit log files through S3 APIs.&lt;/p&gt;

&lt;p&gt;Because this pattern does not rely on S3 ObjectCreated events from FSx for ONTAP S3 Access Points, the audit processor is invoked on a schedule and uses checkpointing to process only newly rotated log files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FSx for ONTAP audit configuration (`vserver audit`)
    │
    ▼ audit logs written to /audit volume
Audit volume exposed via FSx for ONTAP S3 Access Point
    │
    ▼ EventBridge Scheduler (periodic invocation)
Lambda audit processor (Python 3.12)
    │
    ▼ parse EVTX/XML → normalize → vendor API
Datadog / Splunk / New Relic / Grafana / Elastic / ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key shift from the EC2 pattern: &lt;strong&gt;Lambda does not mount the audit volume over NFS or SMB. It reads rotated ONTAP audit log files through an FSx for ONTAP S3 Access Point using S3 APIs, while the data itself remains on the FSx for ONTAP file system.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A note on FSx for ONTAP S3 Access Points&lt;/strong&gt;: FSx for ONTAP S3 Access Points let applications use S3 APIs to access data that still resides on FSx for ONTAP volumes. They are excellent as a serverless access boundary, but they are not the same as standard S3 buckets. In particular, you should not rely on S3 ObjectCreated notifications from an FSx for ONTAP S3 Access Point. Instead, this project uses EventBridge Scheduler plus checkpointing to discover and process newly rotated audit log files.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Three Event Sources, One Architecture
&lt;/h2&gt;

&lt;p&gt;FSx for ONTAP generates observability data through three distinct channels:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. File Access Audit Logs (FSx for ONTAP S3 AP)
&lt;/h3&gt;

&lt;p&gt;Depending on your ONTAP audit configuration and SACL/NFSv4 ACL settings, file operations such as create, delete, read, write, and permission changes can be recorded as ONTAP audit logs in EVTX or XML format.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delivery&lt;/strong&gt;: ONTAP writes rotated audit log files to an audit volume inside the SVM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access path&lt;/strong&gt;: Lambda reads those files through an FSx for ONTAP S3 Access Point&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger&lt;/strong&gt;: EventBridge Scheduler invokes Lambda periodically; Lambda uses checkpointing to process newly rotated files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute&lt;/strong&gt;: Lambda (scheduled, pay-per-invocation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Near-real-time rather than sub-second streaming. End-to-end latency depends on your ONTAP audit log rotation interval and the EventBridge Scheduler frequency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case&lt;/strong&gt;: Compliance auditing, access pattern analysis, data governance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. EMS (Event Management System) Webhooks
&lt;/h3&gt;

&lt;p&gt;ONTAP's built-in event system can push critical alerts via HTTP webhooks. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous Ransomware Protection (ARP)&lt;/strong&gt; alerts — ONTAP detects encryption patterns and fires an event&lt;/li&gt;
&lt;li&gt;Quota threshold violations&lt;/li&gt;
&lt;li&gt;Hardware failures&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replication issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delivery&lt;/strong&gt;: ONTAP pushes HTTPS webhook to API Gateway&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trigger&lt;/strong&gt;: API Gateway invocation (event-driven)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;: Lambda (behind API Gateway)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Security alerting, operational monitoring&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. FPolicy (File Policy) Events
&lt;/h3&gt;

&lt;p&gt;FPolicy intercepts file operations at the protocol level (CIFS/NFS) and forwards them in real-time via a proprietary TCP protocol. Unlike the other two sources, FPolicy requires a persistent TCP listener — which is why this path uses &lt;strong&gt;ECS Fargate&lt;/strong&gt; rather than Lambda.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delivery&lt;/strong&gt;: ONTAP connects to Fargate task via TCP:9898&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger&lt;/strong&gt;: Fargate receives FPolicy events → enqueues to SQS → Lambda processes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute&lt;/strong&gt;: ECS Fargate (TCP listener) + Lambda (vendor shipping)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case&lt;/strong&gt;: File activity monitoring, DLP, suspicious behavior detection&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The FPolicy path is the one exception to the "pure Lambda" model. ONTAP's FPolicy protocol is a proprietary binary format over TCP — it cannot be received by API Gateway or Lambda directly. Fargate handles the protocol translation, then hands off to Lambda via SQS for the vendor-specific shipping. It's still EC2-free, but not entirely serverless in the strictest sense.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Each event source feeds into the same delivery pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    FSx for ONTAP                                │
├──────────────┬──────────────────────┬───────────────────────────┤
│ File Access  │   EMS Webhook        │   FPolicy (TCP:9898)      │
│ Audit Logs   │                      │                           │
└──────┬───────┴──────────┬───────────┴───────────┬───────────────┘
       │                  │                       │
       ▼                  ▼                       ▼
  FSx S3 AP +        API Gateway            ECS Fargate
  Scheduler               │                       │
       │                  ▼                       ▼
       ▼             Lambda (EMS)           SQS → Lambda
  Lambda (parser)         │                       │
       │                  │                       │
       └──────────────────┼───────────────────────┘
                          ▼
              Observability Vendor API
              (Datadog, Splunk, New Relic, ...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each integration packages the parser and vendor shipper together in a single Lambda, but the pattern is the same: normalize ONTAP events, then send them to the vendor API. Swap the integration Lambda, and you switch vendors. Vendor-specific Lambdas are optimized for quick adoption and native API behavior, while the OpenTelemetry integration provides a vendor-neutral path for organizations standardizing on OTLP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha That Cost Me a Day
&lt;/h2&gt;

&lt;p&gt;Here's something that isn't immediately obvious from the documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In my validation, a Lambda function placed in a VPC with only an S3 Gateway Endpoint could not read from the FSx for ONTAP S3 Access Point and timed out. Adding NAT Gateway egress resolved the issue.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This gotcha matters because this project intentionally reads audit logs through FSx for ONTAP S3 Access Points rather than mounting the audit volume over NFS/SMB from an EC2 instance.&lt;/p&gt;

&lt;p&gt;Tested with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda in private subnets (ap-northeast-1)&lt;/li&gt;
&lt;li&gt;FSx for ONTAP S3 Access Point attached to an FSx volume&lt;/li&gt;
&lt;li&gt;S3 Gateway VPC Endpoint only&lt;/li&gt;
&lt;li&gt;No NAT Gateway&lt;/li&gt;
&lt;li&gt;Failure mode: timeout (no response, not AccessDenied)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lambda Placement&lt;/th&gt;
&lt;th&gt;FSx for ONTAP S3 AP Access&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Outside VPC&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;Simplest for read-only access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC + NAT Gateway&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;Production recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC + S3 Gateway EP only&lt;/td&gt;
&lt;td&gt;❌ Timeout&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Not recommended based on this validation&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;This is based on my validation environment (ap-northeast-1). Always test the network path in your own account and Region, as AWS may update this behavior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Target Vendors
&lt;/h2&gt;

&lt;p&gt;The project targets 9 observability platforms. &lt;strong&gt;Datadog is fully verified end-to-end&lt;/strong&gt; (the subject of Parts 2 and 3 of this series). The remaining vendors have initial implementations that I'll be verifying and writing about in upcoming posts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vendor&lt;/th&gt;
&lt;th&gt;Delivery Method&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logs API v2&lt;/td&gt;
&lt;td&gt;✅ &lt;strong&gt;E2E verified&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splunk&lt;/td&gt;
&lt;td&gt;HEC (HTTP Event Collector)&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Relic&lt;/td&gt;
&lt;td&gt;Log API v1&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Cloud&lt;/td&gt;
&lt;td&gt;Loki Push API&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elastic&lt;/td&gt;
&lt;td&gt;Bulk API&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dynatrace&lt;/td&gt;
&lt;td&gt;Log Ingest API v2&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sumo Logic&lt;/td&gt;
&lt;td&gt;HTTP Source&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;Events Batch API&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenTelemetry&lt;/td&gt;
&lt;td&gt;OTLP/HTTP (vendor-neutral)&lt;/td&gt;
&lt;td&gt;🧪 Implementation ready, verification planned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Status definitions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;E2E verified&lt;/strong&gt; — Deployed and validated with real FSx for ONTAP audit logs&lt;/li&gt;
&lt;li&gt;🧪 &lt;strong&gt;Implementation ready&lt;/strong&gt; — Code and CloudFormation available; E2E validation pending&lt;/li&gt;
&lt;li&gt;🚧 &lt;strong&gt;Planned&lt;/strong&gt; — Design exists; implementation pending&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each vendor integration is designed as a self-contained CloudFormation stack with its own Lambda, IAM roles, DLQ, and CloudWatch alarms. As I verify each one, I'll publish a dedicated article with the results and any vendor-specific gotchas I encounter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in the Repo
&lt;/h2&gt;

&lt;p&gt;The project is structured for easy adoption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;fsxn-observability-integrations/&lt;/span&gt;
&lt;span class="s"&gt;├── integrations/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── datadog/&lt;/span&gt;           &lt;span class="c1"&gt;# ✅ Verified: Lambda + CFn + tests + docs&lt;/span&gt;
&lt;span class="s"&gt;│   ├── splunk-serverless/&lt;/span&gt; &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── new-relic/&lt;/span&gt;         &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── grafana/&lt;/span&gt;           &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── elastic/&lt;/span&gt;           &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── dynatrace/&lt;/span&gt;         &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── sumo-logic/&lt;/span&gt;        &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   ├── honeycomb/&lt;/span&gt;         &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;│   └── otel-collector/&lt;/span&gt;    &lt;span class="c1"&gt;# 🧪 Implementation ready&lt;/span&gt;
&lt;span class="s"&gt;├── shared/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── lambda-layers/&lt;/span&gt;     &lt;span class="c1"&gt;# Reusable log parser (EVTX/XML) + S3 AP reader&lt;/span&gt;
&lt;span class="s"&gt;│   ├── templates/&lt;/span&gt;         &lt;span class="c1"&gt;# Prerequisites CFn (EventBridge Scheduler, IAM)&lt;/span&gt;
&lt;span class="s"&gt;│   └── scripts/&lt;/span&gt;           &lt;span class="c1"&gt;# Deploy + test utilities&lt;/span&gt;
&lt;span class="s"&gt;└── docs/&lt;/span&gt;                  &lt;span class="c1"&gt;# Bilingual (EN/JA) documentation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shared infrastructure (EventBridge Scheduler, log parser layer, IAM roles) is vendor-agnostic and already proven through the Datadog verification. Each vendor directory follows the same structure, so once you understand one, you understand them all. Each stack is designed to include DLQ, CloudWatch alarms, and operational visibility out of the box; the Datadog stack also includes the verified CloudWatch operational dashboard used during E2E validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Posts
&lt;/h2&gt;

&lt;p&gt;If you've been following my FSx for ONTAP S3 Access Points series, this project builds directly on those foundations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;FSx for ONTAP S3 Access Points as a Serverless Automation Boundary&lt;/a&gt; — Where this journey started: using S3 APs as the bridge between ONTAP and serverless&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8"&gt;Production-Ready FPolicy Event Pipeline Across 17 UCs — Phase 11&lt;/a&gt; — The FPolicy pipeline that feeds into this observability project&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/near-real-time-processing-ml-inference-and-observability-for-fsx-for-ontap-s3-access-points--bkd"&gt;Near-Real-Time Processing, ML Inference, and Observability — Phase 3&lt;/a&gt; — Early architecture patterns that evolved into this multi-vendor approach&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This observability integrations project is the natural next step: taking those serverless patterns and applying them specifically to audit log shipping across multiple vendors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Considerations
&lt;/h2&gt;

&lt;p&gt;Based on early feedback, here are key points for different audiences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design philosophy&lt;/strong&gt;: The goal is not just to remove EC2. The goal is to move undifferentiated collector operations into managed services, make failures observable and replayable, and keep the integration layer small enough for customers to operate themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where this pattern matters&lt;/strong&gt;: This pattern is especially useful for enterprise file workloads where auditability matters but EC2-based collectors add operational overhead — &lt;br&gt;
departmental file shares, enterprise application interface directories such as SAP, Oracle, or SQL Server adjacent file shares, VDI/EUC home directories, engineering and design repositories, regulated file repositories, and ransomware investigation workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-intrusive by design&lt;/strong&gt;: This pipeline observes audit logs after ONTAP records them; it does not sit in the application data path. NFS/SMB access patterns are unchanged. No application code changes are required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telemetry ownership&lt;/strong&gt;: This pattern treats ONTAP as the authoritative source of file activity telemetry, while AWS managed services provide the event processing and delivery layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance note&lt;/strong&gt;: This pattern helps centralize and analyze audit events, but retention, immutability, and regulatory controls should be designed according to your organization's compliance requirements. This is an audit log &lt;em&gt;delivery&lt;/em&gt; pattern, not a compliance certification. For audit evidence, consider separately how long raw EVTX/XML files should be retained on the audit volume or archived outside the observability pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit policy dependency&lt;/strong&gt;: The quality and volume of events depend heavily on your ONTAP audit policy, SACLs, NFSv4 ACLs, and rotation interval. Enabling read auditing on high-traffic volumes can produce significant log volume — design your audit policy carefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost variables&lt;/strong&gt;: The biggest cost factors are audit event volume, log rotation frequency, EventBridge Scheduler frequency, Lambda runtime, NAT Gateway usage (if Lambda is in VPC), and vendor ingest pricing. Compared to the EC2 pattern, you trade always-on instance cost for pay-per-invocation compute and vendor-ingest-driven cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-account deployment&lt;/strong&gt;: This pattern can be deployed per workload account or centralized into a logging/security account, depending on your organization's landing zone design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability&lt;/strong&gt;: The stack includes DLQ for failed events, CloudWatch alarms for error/throttle detection, and checkpointing to avoid reprocessing already completed audit log files. Delivery to external vendor APIs should be treated as at-least-once; DLQ messages can be replayed after resolving the root cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Coming Next
&lt;/h2&gt;

&lt;p&gt;This is Part 1 of a series. In the upcoming posts, I'll deep-dive into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: Implementing the Datadog integration end-to-end — from CloudFormation to seeing logs in the Datadog Log Explorer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Event-driven ransomware detection using ONTAP's Autonomous Ransomware Protection (ARP) + EMS webhooks + Datadog alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond this Datadog series, I'll be verifying and writing about each vendor integration as I go:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replacing the EC2-based Splunk pattern with Lambda + HEC&lt;/li&gt;
&lt;li&gt;OpenTelemetry as the vendor-neutral escape hatch&lt;/li&gt;
&lt;li&gt;Grafana Cloud + Loki for the open-source stack&lt;/li&gt;
&lt;li&gt;And more — each with its own E2E verification and lessons learned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to build a comprehensive, battle-tested pattern library where you can pick your vendor and deploy with confidence. Follow along as I work through each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The Datadog integration is fully verified and ready to deploy. You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An FSx for ONTAP file system with audit logging enabled&lt;/li&gt;
&lt;li&gt;An FSx for ONTAP S3 Access Point attached to the audit volume
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git
&lt;span class="nb"&gt;cd &lt;/span&gt;fsxn-observability-integrations

&lt;span class="c"&gt;# Deploy Datadog integration&lt;/span&gt;
&lt;span class="c"&gt;# (FsxS3AccessPointArn = your FSx for ONTAP S3 Access Point ARN)&lt;/span&gt;
aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FsxS3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-fsx-s3-ap-arn&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-secret-arn&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stack deploys the scheduled Lambda processor, IAM permissions for reading from the FSx for ONTAP S3 Access Point, checkpoint storage, DLQ, CloudWatch alarms, and the Datadog shipping logic. The processor keeps track of already-processed audit log files so each scheduled invocation only ships newly rotated logs.&lt;/p&gt;

&lt;p&gt;After deployment, you should see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EventBridge Scheduler invoking the Lambda processor on your configured interval&lt;/li&gt;
&lt;li&gt;Checkpoint storage updated after processing rotated audit logs&lt;/li&gt;
&lt;li&gt;Parsed FSx for ONTAP audit events arriving in Datadog Logs (&lt;code&gt;source:fsxn&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;CloudWatch alarms and DLQ ready for operational visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full setup guide in the repo's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/prerequisites.md" rel="noopener noreferrer"&gt;Prerequisites doc&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are starting from the repository today, begin with:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Choose Your Path&lt;/li&gt;
&lt;li&gt;Recommended first 30 minutes&lt;/li&gt;
&lt;li&gt;Try with Sample Data&lt;/li&gt;
&lt;li&gt;PoC Success Criteria&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Production Readiness Levels
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Have questions or want to see a specific vendor integration verified next? Drop a comment below — it'll help me prioritize the series.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next up: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Shipping FSx for ONTAP Logs to Datadog — The Serverless Way&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>observability</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Production-Ready FPolicy Event Pipeline Across 17 UCs — FSx for ONTAP S3 Access Points, Phase 11</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Fri, 15 May 2026 07:52:56 +0000</pubDate>
      <link>https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8</link>
      <guid>https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Phase 11 is the production-integration phase: the Phase 10 FPolicy event-ingestion pipeline is now connected to all 17 use-case (UC) templates, with operational guardrails for persistence, deduplication, observability, and future migration to native S3 Access Point (S3AP) notifications.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Phase 11&lt;/strong&gt; of the FSx for ONTAP S3AP serverless pattern library. Building on &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt;, Phase 11 delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TriggerMode across all 17 UCs&lt;/strong&gt;: Every UC template now supports POLLING / EVENT_DRIVEN / HYBRID switching via a single CloudFormation parameter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UC-specific EventBridge dispatch rules&lt;/strong&gt;: File path prefix + extension filters route FPolicy events to the correct UC's Step Functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protobuf format evaluation&lt;/strong&gt;: Real-world test on ONTAP 9.17.1P6 — confirmed format switching works, discovered TCP framing difference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Account Observability&lt;/strong&gt;: OAM Sink + Dashboard + SNS + X-Ray deployed and verified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store&lt;/strong&gt;: Configured on ONTAP via REST API — closing the tested Fargate restart event-loss window at the configuration layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency Store&lt;/strong&gt;: DynamoDB table + checker Lambda for HYBRID mode deduplication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FR-2 migration path&lt;/strong&gt;: Three-phase design for transitioning to S3AP native notifications when available (FR-2 refers to the feature-request track for native bucket-notification-style support on FSx ONTAP S3 Access Points)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production adoption guidance&lt;/strong&gt;: Rollout/rollback, governance, security guardrails, event payload sensitivity, file-readiness patterns, operational alarms, and Persistent Store sizing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 17 UCs span compliance, financial document processing (IDP), manufacturing analytics, healthcare imaging, media/VFX, genomics, logistics, retail, autonomous driving, semiconductor EDA, energy/seismic, education/research, defense/satellite, government archives, smart-city geospatial, insurance claims, and construction BIM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In short&lt;/strong&gt;: Phase 10 built the shared event-ingestion pipeline. Phase 11 wires it into every UC, adds the operational infrastructure for production (Persistent Store, Idempotency, Observability), and documents the forward migration path. Tests: 435 passed, 3 skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. TriggerMode: Three-Mode Integration Across All 17 UCs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Phase 10 introduced &lt;code&gt;TriggerMode&lt;/code&gt; as a reference implementation in UC1 (legal-compliance). The remaining 16 UCs still only supported polling. Operators needed a uniform way to switch any UC between polling, event-driven, and hybrid modes without template surgery.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;Every UC template now includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;TriggerMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;String&lt;/span&gt;
    &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POLLING"&lt;/span&gt;
    &lt;span class="na"&gt;AllowedValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POLLING"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_DRIVEN"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HYBRID"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;FPolicyEventBusName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;String&lt;/span&gt;
    &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fsxn-fpolicy-events"&lt;/span&gt;

&lt;span class="na"&gt;Conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;IsPollingOrHybrid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;!Or&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;!Condition&lt;/span&gt; &lt;span class="nv"&gt;IsPolling&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!Condition&lt;/span&gt; &lt;span class="nv"&gt;IsHybrid&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;IsEventDrivenOrHybrid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;!Or&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;!Condition&lt;/span&gt; &lt;span class="nv"&gt;IsEventDriven&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!Condition&lt;/span&gt; &lt;span class="nv"&gt;IsHybrid&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The EventBridge Scheduler and its IAM role use &lt;code&gt;Condition: IsPollingOrHybrid&lt;/code&gt;. The FPolicy EventBridge Rule and its IAM role use &lt;code&gt;Condition: IsEventDrivenOrHybrid&lt;/code&gt;. Default &lt;code&gt;POLLING&lt;/code&gt; means zero impact on existing deployments — the parameter is purely additive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CloudFormation &lt;code&gt;validate-template&lt;/code&gt;: 17/17 PASS&lt;/li&gt;
&lt;li&gt;cfn_yaml parse: 17/17 PASS&lt;/li&gt;
&lt;li&gt;SchedulerRole + Schedule condition alignment: 14/14 verified&lt;/li&gt;
&lt;li&gt;Test suite: 435 passed, 3 skipped, 0 failed&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. UC-Specific EventBridge Dispatch Rules
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EventBridge Custom Bus (fsxn-fpolicy-events)
  │
  ├── UC1 Rule: prefix=/legal/ OR suffix=.pdf,.docx,.xlsx
  │     → ComplianceStateMachine
  │
  ├── UC2 Rule: prefix=/finance/ OR suffix=.pdf,.tiff,.png,.jpg
  │     → IdpStateMachine
  │
  ├── UC3 Rule: prefix=/manufacturing/ OR suffix=.csv,.json,.parquet
  │     → ManufacturingStateMachine
  │
  │   ... (14 more UCs)
  │
  └── UC17 Rule: prefix=/smartcity/ OR suffix=.geojson,.shp,.tiff,.las
        → DiscoveryFunction (Lambda)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: Multiple rules can match the same event; EventBridge fan-out is expected behavior. See the Live E2E verification below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As the number of UCs grows, routing should be treated as configuration data and used to generate both EventBridge rules and routing tests to prevent drift. The routing definitions documented in &lt;code&gt;docs/guides/fpolicy-uc-routing.md&lt;/code&gt; are treated as the source of truth, and &lt;code&gt;scripts/add_eventbridge_rules.py&lt;/code&gt; keeps generated EventBridge rules aligned with that routing model.&lt;/p&gt;

&lt;p&gt;Each UC's EventBridge Rule filters on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;detail.file_path&lt;/code&gt;&lt;/strong&gt;: prefix (directory) and suffix (extension) matchers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;detail.operation_type&lt;/code&gt;&lt;/strong&gt;: create, write, rename, delete (UC-specific subset)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;EventBridge evaluates prefix and suffix within the same array as OR — a file matching &lt;em&gt;any&lt;/em&gt; prefix or &lt;em&gt;any&lt;/em&gt; suffix triggers the rule. The relationship between &lt;code&gt;operation_type&lt;/code&gt; and &lt;code&gt;file_path&lt;/code&gt; is AND — both must match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fan-out behavior
&lt;/h3&gt;

&lt;p&gt;When multiple rules match the same event, EventBridge delivers to all matching targets. This is by design — a &lt;code&gt;.json&lt;/code&gt; file in &lt;code&gt;/manufacturing/sensors/&lt;/code&gt; could trigger both UC3 (manufacturing) and UC11 (autonomous-driving) if both monitor &lt;code&gt;.json&lt;/code&gt; files. Prefix design should minimize unintended fan-out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Live E2E verification
&lt;/h3&gt;

&lt;p&gt;We verified dispatch routing by sending test events directly to the custom bus via &lt;code&gt;aws events put-events&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Event&lt;/th&gt;
&lt;th&gt;file_path&lt;/th&gt;
&lt;th&gt;Matched Rules&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-legal-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/legal/audit/report.pdf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;legal-compliance ✅ + financial-idp ✅&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fan-out: 2 rules matched&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-finance-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/finance/contracts/deal.tiff&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;financial-idp ✅&lt;/td&gt;
&lt;td&gt;1 rule matched&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-mfg-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/manufacturing/iot/sensor-001.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;manufacturing ✅&lt;/td&gt;
&lt;td&gt;1 rule matched&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-nomatch-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/random/path/file.xyz&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Correctly dropped&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding&lt;/strong&gt;: &lt;code&gt;/legal/audit/report.pdf&lt;/code&gt; matched two rules — the legal-compliance rule (prefix &lt;code&gt;/legal/&lt;/code&gt;) AND the financial-idp rule (suffix &lt;code&gt;.pdf&lt;/code&gt;). This confirms the OR evaluation within the &lt;code&gt;file_path&lt;/code&gt; array and demonstrates fan-out behavior in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommendation&lt;/strong&gt;: The main routing lesson is simple: use path prefix as the primary ownership boundary, and treat suffix filters as supplementary hints. Generic suffixes such as &lt;code&gt;.pdf&lt;/code&gt;, &lt;code&gt;.json&lt;/code&gt;, and &lt;code&gt;.csv&lt;/code&gt; are useful for discovery, but they can create intentional or accidental fan-out across UCs. For strict single-UC routing, rely on prefix alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing documentation
&lt;/h3&gt;

&lt;p&gt;Full routing table with all 17 UCs, their prefixes, suffixes, and target operations is documented in &lt;code&gt;docs/guides/fpolicy-uc-routing.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9yl9vseneg6whek4o4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9yl9vseneg6whek4o4s.png" alt="EventBridge Custom Bus"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Protobuf Format Evaluation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Background
&lt;/h3&gt;

&lt;p&gt;ONTAP 9.15.1+ supports protobuf as an alternative to XML for FPolicy notifications. The theoretical benefits are significant: ~35% message size reduction and faster parsing (with C extensions).&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;Phase 11 delivers a complete protobuf implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wire-format parser&lt;/strong&gt; (&lt;code&gt;shared/fpolicy-server/protobuf_parser.py&lt;/code&gt;): Pure Python decoder with zero external dependencies. No &lt;code&gt;protobuf&lt;/code&gt; package installation required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proto schema&lt;/strong&gt; (&lt;code&gt;shared/fpolicy-server/proto/fpolicy_notification.proto&lt;/code&gt;): 14-field FileOperationNotification message definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-detection&lt;/strong&gt;: &lt;code&gt;is_protobuf_format()&lt;/code&gt; distinguishes XML from protobuf by inspecting the first byte.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy Server integration&lt;/strong&gt;: &lt;code&gt;FPOLICY_FORMAT&lt;/code&gt; environment variable switches between &lt;code&gt;xml&lt;/code&gt; and &lt;code&gt;protobuf&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Benchmark results (1000 events)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;XML (regex)&lt;/th&gt;
&lt;th&gt;protobuf (pure Python)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Message size (avg)&lt;/td&gt;
&lt;td&gt;220 bytes&lt;/td&gt;
&lt;td&gt;144 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size reduction&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34.6%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parse time (1000 events)&lt;/td&gt;
&lt;td&gt;0.15 ms&lt;/td&gt;
&lt;td&gt;0.32 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parse speedup&lt;/td&gt;
&lt;td&gt;1.0x (baseline)&lt;/td&gt;
&lt;td&gt;0.47x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pure Python protobuf parser is slower than Python's C-optimized regex engine. The real benefit is message size reduction — 34.6% fewer bytes through SQS means lower costs and bandwidth. With the C-compiled &lt;code&gt;protobuf&lt;/code&gt; library, parsing speed is expected to improve significantly, but this should be re-benchmarked after the protobuf TCP framing layer is implemented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-world test: TCP framing discovery
&lt;/h3&gt;

&lt;p&gt;We switched the ONTAP FPolicy engine format to protobuf via REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PATCH /api/protocols/fpolicy/&lt;span class="o"&gt;{&lt;/span&gt;svm&lt;span class="o"&gt;}&lt;/span&gt;/engines/fpolicy_aws_engine
Body: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"format"&lt;/span&gt;: &lt;span class="s2"&gt;"protobuf"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ONTAP immediately sent protobuf NEGO_REQ messages. However, the FPolicy server logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[WARNING] Invalid message length: 53554736
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;: The value &lt;code&gt;53554736&lt;/code&gt; (0x03320330) is protobuf field data being misinterpreted as the 4-byte frame length. This reveals that &lt;strong&gt;protobuf mode uses different TCP framing than XML mode&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XML mode: &lt;code&gt;"&lt;/code&gt; + 4-byte big-endian length + &lt;code&gt;"&lt;/code&gt; + payload&lt;/li&gt;
&lt;li&gt;protobuf mode: Different framing (possibly raw protobuf without the quote-delimited wrapper)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;: The protobuf field-level parser is validated by the Phase 11 unit tests, and the size-reduction benefit is real. However, the live ONTAP test showed that protobuf mode does not use the same TCP framing path as XML mode. Per NetApp documentation, when the engine format is set to protobuf, "notification messages are encoded in binary form using Google Protobuf" and the FPolicy server must support protobuf deserialization. Phase 12 will focus on confirming the protobuf wire framing with NetApp and adapting the transport reader accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Cross-Account Observability
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deployed resources
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OAM Sink&lt;/td&gt;
&lt;td&gt;Receives metrics/traces from workload accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Dashboard&lt;/td&gt;
&lt;td&gt;Lambda Invocations/Errors, Step Functions Executions, Processing Latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNS Topic (KMS-encrypted)&lt;/td&gt;
&lt;td&gt;Aggregated alerts from all accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;X-Ray Group&lt;/td&gt;
&lt;td&gt;Cross-account trace filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM MetricDeliveryRole&lt;/td&gt;
&lt;td&gt;Workload accounts assume this to push metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM TroubleshootingRole&lt;/td&gt;
&lt;td&gt;Read-only access for cross-account debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log Group&lt;/td&gt;
&lt;td&gt;Aggregated log destination&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Single-account limitation
&lt;/h3&gt;

&lt;p&gt;OAM Links cannot be created within the same account (AWS design constraint). The deployment was verified as a single-account simulation per the requirements. A &lt;code&gt;workload-account-oam-link.yaml&lt;/code&gt; template is provided for multi-account environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Template fix: LogDestination
&lt;/h3&gt;

&lt;p&gt;During deployment, &lt;code&gt;AWS::Logs::Destination&lt;/code&gt; failed because it requires a Kinesis Data Stream as target, not a Log Group. This clarified that a CloudWatch Logs destination is not a generic alias for another log group; it is a cross-account subscription destination backed by a supported streaming target such as Kinesis Data Streams or Kinesis Data Firehose. The template was fixed to use Log Group + IAM Role directly, with Kinesis Firehose as an optional future addition for high-volume cross-account log aggregation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgofilfe774cxejlx95r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgofilfe774cxejlx95r.png" alt="CloudWatch Cross-Account Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Persistent Store: Closing the Restart Event-Loss Window
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;is-mandatory=false&lt;/code&gt;, ONTAP drops FPolicy notifications when no server is connected. During Fargate task restarts (~30 seconds), events are lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;ONTAP 9.14.1+ Persistent Store queues file access events on the SVM during server disconnection for asynchronous non-mandatory policies. When the external server reconnects, queued events can be replayed. Note that synchronous policies and asynchronous mandatory policies are not supported — Persistent Store is specifically designed for the asynchronous non-mandatory configuration used in this pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration (via Lambda → ONTAP REST API)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;Step 1: Create volume (1GB, unix security style)
  POST /api/storage/volumes → 202 Accepted (3s)

Step 2: Create Persistent Store
  POST /api/protocols/fpolicy/{svm}/persistent-stores → 201 Created
  Body: {"name": "fpolicy_aws_store", "volume": "fpolicy_persistent_store"}

Step 3: Attach to policy (disable → attach → re-enable)
  PATCH /api/protocols/fpolicy/{svm}/policies/fpolicy_aws
  Body: {"persistent_store": "fpolicy_aws_store"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/api/protocols/fpolicy/&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;/policies/fpolicy_aws?fields=persistent_store,enabled&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"persistent_store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fpolicy_aws_store"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ECS task stop → restart test confirmed ONTAP reconnects to the new task within seconds. With Persistent Store configured, events generated during the tested ~30-second Fargate restart window are expected to be queued by ONTAP and replayed after reconnection. Phase 12 will validate this with real NFS/SMB file operations end to end, including verification of replay ordering and completeness under sustained write load.&lt;/p&gt;

&lt;h3&gt;
  
  
  IP Updater Lambda extension
&lt;/h3&gt;

&lt;p&gt;The IP Updater Lambda was extended with a generic ONTAP API access capability (&lt;code&gt;action: ontap_api&lt;/code&gt;). This enables remote ONTAP configuration without a bastion host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws lambda invoke &lt;span class="nt"&gt;--function-name&lt;/span&gt; fsxn-fpolicy-ip-updater &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{"action": "ontap_api", "method": "GET", "path": "/api/protocols/fpolicy/{svm}/persistent-stores"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  /tmp/result.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. HYBRID Mode Idempotency
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;In HYBRID mode, both the EventBridge Scheduler (polling) and the FPolicy EventBridge Rule (event-driven) can trigger processing for the same file. Without deduplication, the same file gets processed twice.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A DynamoDB-based Idempotency Store with TTL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fsxn-s3ap-idempotency-store&lt;/span&gt;
  &lt;span class="s"&gt;pk&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{uc_name}#{file_path}"&lt;/span&gt;
  &lt;span class="na"&gt;sk&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{operation_type}#{timestamp_bucket}"&lt;/span&gt;
  &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;current_time + 7 days&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;timestamp_bucket&lt;/code&gt; rounds timestamps to 5-minute windows. Two events for the same file within the same 5-minute window are considered duplicates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step Functions integration
&lt;/h3&gt;

&lt;p&gt;The Idempotency Checker runs as the first step in any UC's Step Functions workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IdempotencyCheck"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"IdempotencyCheck"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${IdempotencyCheckerFunction.Arn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CheckDuplicate"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"CheckDuplicate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Choice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.idempotency.is_duplicate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"BooleanEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SkipDuplicate"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProcessEvent"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Race conditions are handled via DynamoDB conditional writes (&lt;code&gt;attribute_not_exists(pk)&lt;/code&gt;). If two executions race, only one succeeds — the other gets &lt;code&gt;ConditionalCheckFailedException&lt;/code&gt; and skips.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tuning considerations
&lt;/h3&gt;

&lt;p&gt;The 5-minute bucket is intentionally conservative for HYBRID-mode deduplication. UCs that require multiple legitimate updates to the same file within a short interval can tune the bucket size via the &lt;code&gt;DEDUP_WINDOW_MINUTES&lt;/code&gt; environment variable, or include an additional event attribute (such as file size or ONTAP event sequence information) in the sort key to distinguish genuinely distinct events from duplicates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcax9hw30kkp1ypuhx22z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcax9hw30kkp1ypuhx22z.png" alt="DynamoDB Idempotency Store"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Live E2E verification
&lt;/h3&gt;

&lt;p&gt;Verified the deduplication mechanism directly against the deployed DynamoDB table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1st PutItem (pk=legal-compliance#/legal/audit/report.pdf, sk=create#2026-05-15T10:35):
  → Success (new record created)

2nd PutItem (same key, condition: attribute_not_exists(pk)):
  → ConditionalCheckFailedException ✅ (duplicate detected)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This proves the table-level duplicate rejection mechanism used by HYBRID mode. When the Idempotency Checker is the first Step Functions task, the second execution can be rejected before downstream processing starts.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. FR-2 Migration Path
&lt;/h2&gt;

&lt;p&gt;If/when native S3AP notifications become available through the FR-2 track, the migration is designed to be parameter-change-only for UCs that do not depend on FPolicy-only fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;TriggerMode&lt;/th&gt;
&lt;th&gt;FPolicy&lt;/th&gt;
&lt;th&gt;S3AP Notifications&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A (parallel)&lt;/td&gt;
&lt;td&gt;HYBRID&lt;/td&gt;
&lt;td&gt;Active&lt;/td&gt;
&lt;td&gt;Active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B (cutover)&lt;/td&gt;
&lt;td&gt;EVENT_DRIVEN&lt;/td&gt;
&lt;td&gt;Disabled&lt;/td&gt;
&lt;td&gt;Active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C (cleanup)&lt;/td&gt;
&lt;td&gt;EVENT_DRIVEN&lt;/td&gt;
&lt;td&gt;Removed&lt;/td&gt;
&lt;td&gt;Active&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Schema compatibility challenges
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;FPolicy field&lt;/th&gt;
&lt;th&gt;S3AP equivalent&lt;/th&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;S3AP may not include NTFS user info&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;operation_type: rename&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;S3 events don't have rename&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;protocol&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Always "s3"&lt;/td&gt;
&lt;td&gt;Loss of NFS/SMB distinction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;UCs that depend on &lt;code&gt;user_name&lt;/code&gt; (permission-aware scenarios) may need to maintain FPolicy even after FR-2 GA.&lt;/p&gt;

&lt;p&gt;Full migration path documented in &lt;code&gt;docs/guides/fr2-migration-path.md&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Test Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Existing tests (Phase 1-10)&lt;/td&gt;
&lt;td&gt;391&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;protobuf parser tests&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idempotency checker tests&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy engine tests&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skipped (handler refactored)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Expected ⏭️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;435 + 3 skipped&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;All PASS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  CloudFormation validation
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cfn_yaml parse (all 17 UCs)&lt;/td&gt;
&lt;td&gt;17/17 PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;aws cloudformation validate-template&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;17/17 PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shared templates (observability, idempotency, OAM link)&lt;/td&gt;
&lt;td&gt;4/4 PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  9. Deployed AWS Resources
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Resources&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-shared-observability&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OAM Sink, Dashboard, SNS, X-Ray Group, IAM Roles&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-idempotency-store&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DynamoDB (PAY_PER_REQUEST, TTL, PITR)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-fpolicy-routing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EventBridge Bus, Bridge Lambda, Idempotency Table&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-fp-srv&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ECS Fargate Cluster, FPolicy Server Service&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-fpolicy-ingestion&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SQS Queue, DLQ, IP Updater Lambda&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ONTAP resources
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy policy &lt;code&gt;fpolicy_aws&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Enabled, persistent_store attached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store &lt;code&gt;fpolicy_aws_store&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Active (1GB volume)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine format&lt;/td&gt;
&lt;td&gt;XML (protobuf tested, reverted due to framing)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Post-deployment health check (2026-05-15)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy Server (ECS Fargate)&lt;/td&gt;
&lt;td&gt;✅ Running&lt;/td&gt;
&lt;td&gt;ONTAP connecting every 10s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQS Ingestion Queue&lt;/td&gt;
&lt;td&gt;✅ Empty (0/0/0)&lt;/td&gt;
&lt;td&gt;No stuck messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy Policy&lt;/td&gt;
&lt;td&gt;✅ Enabled&lt;/td&gt;
&lt;td&gt;persistent_store + engine attached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB Idempotency&lt;/td&gt;
&lt;td&gt;✅ Active&lt;/td&gt;
&lt;td&gt;TTL enabled, PITR on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNS Alerts&lt;/td&gt;
&lt;td&gt;⚠️ PendingConfirmation&lt;/td&gt;
&lt;td&gt;Email subscription awaiting confirmation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge Custom Bus&lt;/td&gt;
&lt;td&gt;✅ Operational&lt;/td&gt;
&lt;td&gt;Dispatch routing verified via put-events&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7iz4evyqw2nta28enhwm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7iz4evyqw2nta28enhwm.png" alt="CloudFormation Stacks"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F12jhhrmk3ewk5jbzwzha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F12jhhrmk3ewk5jbzwzha.png" alt="ECS FPolicy Server"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  10. Deployment Learnings
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;validate-template&lt;/code&gt; fails for autonomous-driving&lt;/td&gt;
&lt;td&gt;Template exceeds 51,200 byte inline limit&lt;/td&gt;
&lt;td&gt;Use S3 URL for validation; added CI job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;AWS::Logs::Destination&lt;/code&gt; creation fails&lt;/td&gt;
&lt;td&gt;Requires Kinesis target, not Log Group&lt;/td&gt;
&lt;td&gt;Removed LogDestination, use Log Group directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OAM Link same-account error&lt;/td&gt;
&lt;td&gt;AWS design: links only work cross-account&lt;/td&gt;
&lt;td&gt;Documented; provided workload-account template&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SchedulerRole created in EVENT_DRIVEN mode&lt;/td&gt;
&lt;td&gt;Missing Condition on SchedulerRole&lt;/td&gt;
&lt;td&gt;Added &lt;code&gt;Condition: IsPollingOrHybrid&lt;/code&gt; to 14 templates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;protobuf messages rejected as invalid length&lt;/td&gt;
&lt;td&gt;Different TCP framing in protobuf mode&lt;/td&gt;
&lt;td&gt;Documented; XML mode maintained for stability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;test_fpolicy_engine&lt;/code&gt; import errors&lt;/td&gt;
&lt;td&gt;Handler refactored to IP Updater&lt;/td&gt;
&lt;td&gt;Added missing exports; skipped 3 legacy tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store &lt;code&gt;autoflush_enabled&lt;/code&gt; rejected&lt;/td&gt;
&lt;td&gt;Parameter name not supported in REST API&lt;/td&gt;
&lt;td&gt;Removed; ONTAP uses defaults&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Policy modification while enabled&lt;/td&gt;
&lt;td&gt;ONTAP rejects PATCH on enabled policy&lt;/td&gt;
&lt;td&gt;Disable → modify → re-enable sequence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;.pdf&lt;/code&gt; suffix causes multi-UC fan-out&lt;/td&gt;
&lt;td&gt;EventBridge OR evaluation within array&lt;/td&gt;
&lt;td&gt;Document: use prefix as primary filter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge → CloudWatch Logs delivery fails&lt;/td&gt;
&lt;td&gt;Missing resource policy on log group&lt;/td&gt;
&lt;td&gt;Added &lt;code&gt;logs:PutLogEvents&lt;/code&gt; permission for events.amazonaws.com&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  11. Production Adoption Guidance
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Recommended rollout model
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;TriggerMode&lt;/code&gt; is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption. A detailed guide with rollback criteria, UC classification, and CloudFormation behavior matrix is available in &lt;code&gt;docs/guides/triggermode-rollout.md&lt;/code&gt;. The summary:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;POLLING&lt;/code&gt; for all UCs to preserve existing behavior.&lt;/li&gt;
&lt;li&gt;Enable the shared FPolicy ingestion pipeline and validate EventBridge routing with &lt;code&gt;put-events&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Move one low-risk UC to &lt;code&gt;HYBRID&lt;/code&gt; and observe duplicate rate, Step Functions success rate, and SQS backlog.&lt;/li&gt;
&lt;li&gt;Move latency-sensitive UCs to &lt;code&gt;EVENT_DRIVEN&lt;/code&gt; after routing and idempotency validation.&lt;/li&gt;
&lt;li&gt;Keep compliance-sensitive UCs in &lt;code&gt;HYBRID&lt;/code&gt; until Persistent Store replay is validated end to end.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Rollback&lt;/strong&gt;: At any stage, reverting &lt;code&gt;TriggerMode&lt;/code&gt; to the previous value via CloudFormation stack update restores the CloudFormation-managed resources for the prior mode. Operators should wait for stack update completion and verify scheduler/rule state, SQS backlog, and Step Functions executions before declaring rollback complete. The sequence is always &lt;code&gt;EVENT_DRIVEN → HYBRID → POLLING&lt;/code&gt; (never skip HYBRID when rolling back from EVENT_DRIVEN in production).&lt;/p&gt;
&lt;h3&gt;
  
  
  Security guardrails for ONTAP API automation
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ontap_api&lt;/code&gt; action is intended for controlled operations automation, not as an unrestricted ONTAP proxy. The handler implementation (&lt;code&gt;shared/lambdas/fpolicy_engine/handler.py&lt;/code&gt;) enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Path allowlist&lt;/strong&gt;: Only &lt;code&gt;/api/protocols/fpolicy/&lt;/code&gt;, &lt;code&gt;/api/storage/volumes&lt;/code&gt;, &lt;code&gt;/api/storage/aggregates&lt;/code&gt;, and &lt;code&gt;/api/cluster/jobs/&lt;/code&gt; are permitted. All other paths return HTTP 403.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DELETE method restriction&lt;/strong&gt;: Disabled by default. Requires explicit &lt;code&gt;ONTAP_API_ALLOW_DELETE=true&lt;/code&gt; environment variable to enable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log redaction&lt;/strong&gt;: Only method and path are logged — request bodies containing credentials are never written to CloudWatch Logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured audit log&lt;/strong&gt;: Each invocation emits a structured log line with &lt;code&gt;method&lt;/code&gt;, &lt;code&gt;path&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;correlation_id&lt;/code&gt;, and request timestamp. Caller identity can be correlated via CloudTrail Lambda Invoke events without logging sensitive request/response bodies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production deployments should additionally restrict Lambda invoke permissions to deployment automation roles only, and store ONTAP credentials in Secrets Manager with rotation planning.&lt;/p&gt;

&lt;p&gt;Pass &lt;code&gt;correlation_id&lt;/code&gt; in the event payload to trace ONTAP API operations across deployment automation, Lambda logs, and operational runbooks.&lt;/p&gt;
&lt;h3&gt;
  
  
  MSP and multi-customer naming
&lt;/h3&gt;

&lt;p&gt;For MSP or multi-customer deployments, parameterize shared resource names with &lt;code&gt;CustomerId&lt;/code&gt;, &lt;code&gt;EnvironmentName&lt;/code&gt;, and &lt;code&gt;Region&lt;/code&gt; to avoid cross-tenant naming collisions — for example: &lt;code&gt;{customer}-{env}-fsxn-fpolicy-events&lt;/code&gt; and &lt;code&gt;{customer}-{env}-s3ap-idempotency-store&lt;/code&gt;. Full naming guidance with CloudFormation examples is in &lt;code&gt;docs/guides/triggermode-rollout.md&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  TriggerMode governance
&lt;/h3&gt;

&lt;p&gt;For enterprise rollout, treat &lt;code&gt;TriggerMode&lt;/code&gt; as a governed operational control. Changes from POLLING to HYBRID or EVENT_DRIVEN should be reviewed with routing test results, idempotency validation, alarm readiness, and rollback owner assignment. Track TriggerMode changes through your change management process (Change Manager, GitOps PR, or deployment pipeline logs) — not just CloudFormation stack events.&lt;/p&gt;
&lt;h3&gt;
  
  
  Event payload sensitivity
&lt;/h3&gt;

&lt;p&gt;For public-sector or regulated workloads, file paths and FPolicy metadata should be treated as potentially sensitive data. In regulated environments, metadata is data — file paths, user names, and protocol context should be classified before being forwarded to cross-account observability systems. Production deployments should define which event fields are logged, masked, hashed, or excluded before forwarding to cross-account observability or long-term audit storage. A data classification guide is available in &lt;code&gt;docs/guides/data-classification.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For regulated workloads, duplicate suppression should not mean audit disappearance; skipped duplicate events should still be recorded with correlation IDs and deduplication decisions. See &lt;code&gt;docs/guides/compliance-audit-ledger.md&lt;/code&gt; for the audit ledger design.&lt;/p&gt;
&lt;h3&gt;
  
  
  File readiness for event-driven pipelines
&lt;/h3&gt;

&lt;p&gt;For large files, an FPolicy &lt;code&gt;create&lt;/code&gt; or &lt;code&gt;write&lt;/code&gt; event may arrive before the file write is complete — particularly with NFSv3 which lacks close semantics. UCs that process large analytics, imaging, geospatial, or EDA files should combine event-driven triggering with a readiness strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rename-based commit&lt;/strong&gt;: Write to a temporary path, rename to final path on completion. Process only &lt;code&gt;rename&lt;/code&gt; events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marker file&lt;/strong&gt;: Write a &lt;code&gt;.done&lt;/code&gt; or &lt;code&gt;_SUCCESS&lt;/code&gt; marker after the primary file is complete. Trigger on marker creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size-stability check&lt;/strong&gt;: Poll file size at N-second intervals; start processing when size is stable across two consecutive checks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The existing &lt;code&gt;WRITE_COMPLETE_DELAY_SEC&lt;/code&gt; (default 5s) in the FPolicy server provides a basic delay, but is insufficient for multi-GB files. A fixed delay should be treated as a fallback, not a correctness guarantee. The new UC checklist (&lt;code&gt;docs/guides/new-uc-checklist.md&lt;/code&gt;) includes file readiness as a required design decision for large-file UCs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Recommended operational alarms
&lt;/h3&gt;

&lt;p&gt;A ready-to-deploy CloudFormation template (&lt;code&gt;shared/cfn/recommended-alarms.yaml&lt;/code&gt;) defines the following alarms. Severity labels are examples and should be mapped to each organization's incident classification model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQS &lt;code&gt;ApproximateAgeOfOldestMessage&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&amp;gt; 300 seconds for 5 minutes&lt;/td&gt;
&lt;td&gt;SEV2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQS DLQ &lt;code&gt;ApproximateNumberOfMessagesVisible&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&amp;gt; 0&lt;/td&gt;
&lt;td&gt;SEV2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step Functions &lt;code&gt;ExecutionsFailed&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&amp;gt; 0 for critical production UCs&lt;/td&gt;
&lt;td&gt;SEV2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS &lt;code&gt;RunningTaskCount&lt;/code&gt; &amp;lt; &lt;code&gt;DesiredTaskCount&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;for &amp;gt; 60 seconds&lt;/td&gt;
&lt;td&gt;SEV1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB &lt;code&gt;ThrottledRequests&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&amp;gt; 0&lt;/td&gt;
&lt;td&gt;SEV3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ECS desired-vs-running alarm may require Container Insights, metric math, or a custom service health metric depending on how ECS service metrics are emitted in the target account. For high-volume batch UCs, failure-rate-based alarms may be less noisy than absolute failure-count alarms.&lt;/p&gt;

&lt;p&gt;Deploy as a standalone monitoring stack or integrate into each UC template's &lt;code&gt;EnableCloudWatchAlarms&lt;/code&gt; section.&lt;/p&gt;
&lt;h3&gt;
  
  
  Initial SLO candidates
&lt;/h3&gt;

&lt;p&gt;While formal SLO definition is a Phase 12 deliverable, the following targets serve as initial operational guidance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99% of events delivered to SQS within 60 seconds under normal load&lt;/li&gt;
&lt;li&gt;FPolicy server reconnect within 60 seconds after ECS task replacement&lt;/li&gt;
&lt;li&gt;SQS backlog recovered within 5 minutes after planned maintenance&lt;/li&gt;
&lt;li&gt;Step Functions start latency under 2 minutes for EVENT_DRIVEN UCs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Persistent Store sizing
&lt;/h3&gt;

&lt;p&gt;For environments requiring Persistent Store, size the volume based on expected outage duration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;required_size = event_rate_per_sec × max_outage_duration_sec × avg_event_size_bytes × safety_factor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example: 100 events/sec × 300s outage × 500 bytes × 2.0 safety ≈ 30 MB of raw event data. The 1 GB volume configured in Phase 11 provides room for roughly 2 million 500-byte event records before applying operational safety margin; with a 2.0 safety factor, treat the practical planning capacity as closer to 1 million events. High-frequency environments (1000+ events/sec) should increase the volume size proportionally and validate replay rate after reconnection.&lt;/p&gt;

&lt;p&gt;Full sizing table with scenario-based estimates is available in &lt;code&gt;docs/event-driven/fpolicy-persistent-store.md&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Next Phase Outlook
&lt;/h2&gt;

&lt;p&gt;Phase 11 completes the event-driven integration layer. Remaining work for Phase 12:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Protocol and replay validation&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;protobuf TCP framing&lt;/strong&gt;: Consult NetApp support on protobuf wire format; adapt &lt;code&gt;read_fpolicy_message()&lt;/code&gt; for frameless protobuf&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store replay E2E validation&lt;/strong&gt;: NFS/SMB file creation during Fargate restart → verify that queued events are replayed and delivered to SQS without loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay storm testing&lt;/strong&gt;: Generate events during FPolicy server downtime, reconnect, measure replay duration, SQS ingestion rate, Step Functions concurrency, and whether downstream throttling occurs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Scale and operations&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High-load testing&lt;/strong&gt;: 1000+ events/sec stress test with Fargate scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO definition&lt;/strong&gt;: Define event ingestion latency, processing success rate, reconnect time, and replay completion time targets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-account OAM Link&lt;/strong&gt;: Deploy workload-account-oam-link.yaml in a second account&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Production rollout&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Production UC deployment&lt;/strong&gt;: Deploy a UC template with &lt;code&gt;TriggerMode=EVENT_DRIVEN&lt;/code&gt; and verify end-to-end file operation → Step Functions execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Already verified in Phase 11&lt;/strong&gt; (no longer Phase 12 candidates):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ EventBridge dispatch routing (put-events → rule matching → CloudWatch Logs)&lt;/li&gt;
&lt;li&gt;✅ Idempotency Store deduplication (conditional write rejection)&lt;/li&gt;
&lt;li&gt;✅ Persistent Store configuration (ONTAP REST API)&lt;/li&gt;
&lt;li&gt;✅ ECS task restart + ONTAP reconnection&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Who should care about Phase 11?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Platform teams&lt;/strong&gt; can now switch any UC between polling and event-driven with a single parameter change — no template surgery required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations / SRE teams&lt;/strong&gt; get Cross-Account Observability with a pre-built dashboard, recommended alarm thresholds, and a rollout/rollback model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance teams&lt;/strong&gt; get Persistent Store support to close the tested Fargate restart event-loss window, with full replay validation planned for Phase 12&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt; get documented guardrails for the ONTAP API automation path, including allowlist, audit recommendations, and event payload sensitivity guidance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture teams&lt;/strong&gt; get a documented FR-2 migration path — if/when native S3AP notifications become available, the transition is a parameter change for compatible UCs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data engineering teams&lt;/strong&gt; get file-readiness guidance for large-file analytics pipelines where event arrival precedes write completion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MSPs and partners&lt;/strong&gt; get cross-account templates, tenant-aware naming guidance, and a standardized TriggerMode control for multi-customer deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance engineers&lt;/strong&gt; get protobuf evaluation data (34.6% size reduction) and a clear path to enabling it once TCP framing is resolved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevOps teams&lt;/strong&gt; get CI-integrated template validation (cfn_yaml + validate-template) catching issues before deployment&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Phase 11 transforms the FPolicy event-driven pipeline from a single-UC reference implementation into a production-ready, 17-UC integrated system. &lt;code&gt;TriggerMode&lt;/code&gt; is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption, enabling platform teams to move individual UCs through POLLING → HYBRID → EVENT_DRIVEN at their own pace.&lt;/p&gt;

&lt;p&gt;UC-specific EventBridge rules handle routing complexity through path-prefix ownership boundaries, while the Idempotency Store prevents duplicate processing in HYBRID mode. Persistent Store closes the known Fargate restart event-loss window at the ONTAP configuration layer, while Phase 12 will validate replay completeness with real NFS/SMB file operations.&lt;/p&gt;

&lt;p&gt;The protobuf evaluation yielded a valuable real-world finding: ONTAP uses different TCP framing for protobuf messages than for XML. The field-level parser is validated against test fixtures, but the transport layer needs adaptation — a focused Phase 12 task requiring NetApp consultation rather than a blocker.&lt;/p&gt;

&lt;p&gt;With 435 passing tests, 17 validated templates, 5 deployed CloudFormation stacks, production adoption guidance (rollout model, governance, security guardrails, event payload sensitivity, file readiness, alarm thresholds, Persistent Store sizing), and comprehensive documentation, Phase 11 delivers the operational maturity needed for enterprise-grade event-driven file workflows on FSx for ONTAP.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Previous phases&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;Phase 1&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/public-sector-use-cases-unified-output-destination-and-a-localization-batch-fsx-for-ontap-s3-2hmo"&gt;Phase 7&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/operational-hardening-ci-grade-validation-and-pattern-c-b-hybrid-fsx-for-ontap-s3-access-587h"&gt;Phase 8&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/production-rollout-vpc-endpoint-auto-detection-and-the-cdk-no-go-fsx-for-ontap-s3-access-3lni"&gt;Phase 9&lt;/a&gt; · &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>amazonfsxfornetappontap</category>
      <category>devops</category>
    </item>
    <item>
      <title>Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Fri, 15 May 2026 03:51:55 +0000</pubDate>
      <link>https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3iml</link>
      <guid>https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3iml</guid>
      <description>&lt;h2&gt;
  
  
  What This Post Covers
&lt;/h2&gt;

&lt;p&gt;This is a companion article to the &lt;a href="https://dev.to/yoshikifujiwara/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;FSx for ONTAP S3 Access Points Serverless Patterns&lt;/a&gt; series. While that series focuses on serverless patterns for FSx for ONTAP S3 Access Points across industries, this post covers the &lt;strong&gt;v4.2 release&lt;/strong&gt; of the &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG" rel="noopener noreferrer"&gt;Agentic Access-Aware RAG&lt;/a&gt; system — a permission-aware RAG application built on FSx for ONTAP + Amazon Bedrock, production-grade in the sense of CI coverage, permission filtering, guardrails, and deployment parameterization — while some v4.2 features still have follow-up E2E items listed in What's Next.&lt;/p&gt;

&lt;p&gt;The v4.2 release adds five features that address real-world enterprise needs: intelligent model routing for cost optimization, SFTP-based document ingestion for partners who can't use web UIs, automatic KB synchronization, operational guardrails for FSx ONTAP automation, and voice-based interaction via WebRTC.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Smart Routing Model Expansion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Enterprise RAG workloads have wildly different complexity levels. A simple "What's the office address?" query doesn't need the same model as "Analyze the Q4 financial report across all subsidiaries and identify cost reduction opportunities." Routing everything through a single model either wastes money or delivers poor quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: 3-Tier Automatic Routing
&lt;/h3&gt;

&lt;p&gt;The default routing tiers are configured for the model set currently enabled in this deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple&lt;/strong&gt; (greetings, factual lookups) → Claude Haiku 4.5 (&lt;code&gt;anthropic.claude-haiku-4-5-20251001-v1:0&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex&lt;/strong&gt; (analysis, comparison, summarization) → Claude 3.5 Sonnet v2 (&lt;code&gt;anthropic.claude-3-5-sonnet-20241022-v2:0&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-context&lt;/strong&gt; (multi-document reasoning, financial analysis) → Claude Opus 4 (&lt;code&gt;anthropic.claude-opus-4-0-20250514-v1:0&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact model IDs are deployment parameters (&lt;code&gt;lightweightModelId&lt;/code&gt;, &lt;code&gt;powerfulModelId&lt;/code&gt;, &lt;code&gt;heavyModelId&lt;/code&gt;), so teams can update to newer Sonnet/Opus releases without changing the routing logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                  User Query                          │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Complexity     │
              │  Classifier     │
              └───┬────┬────┬───┘
                  │    │    │
         Simple   │    │    │  Full-context
                  ▼    ▼    ▼
        ┌──────┐ ┌──────┐ ┌──────┐
        │Haiku │ │Sonnet│ │ Opus │
        │ 4.5  │ │3.5 v2│ │  4   │
        └──────┘ └──────┘ └──────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost labels below are illustrative per-query estimates for typical RAG prompts (~1K input tokens, ~500 output tokens) in this deployment, not fixed model prices. Actual cost depends on input/output tokens, prompt caching, region, and inference configuration.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Illustrative per-query cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;~$0.001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 3.5 v2&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4&lt;/td&gt;
&lt;td&gt;~$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Additionally, GPT-5.5 can be exposed as a manual selection option when OpenAI models on Amazon Bedrock are enabled for the account. In this deployment, the manual route is parameterized as &lt;code&gt;openai.gpt-5-5&lt;/code&gt;, but teams should verify the exact model ID, Region availability, inference profile, and preview access status in their own AWS account.&lt;/p&gt;

&lt;p&gt;If the selected model is unavailable or throttled, the router falls back to the next configured tier and emits a &lt;code&gt;RoutingFallback&lt;/code&gt; metric.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;The classifier analyzes query characteristics — keyword count, presence of analytical terms, document references, context size — and routes to the appropriate tier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// complexity-classifier.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;classifyQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;contextSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ClassificationResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractFeatures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isGreeting&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;wordCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hasAnalyticalTerms&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;contextSize&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;full-context&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CloudWatch EMF metrics track routing decisions, enabling cost analysis and route distribution monitoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SmartRouting&lt;/span&gt;
&lt;span class="py"&gt;Metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoutingCount&lt;/span&gt;
&lt;span class="py"&gt;Dimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoutingTier (simple | complex | full-context | manual)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Transfer Family FSx ONTAP Ingestion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Many enterprise partners — law firms, auditors, regulatory bodies — exchange documents via SFTP. They won't adopt a web UI. But their documents still need to flow into the RAG knowledge base with proper permission metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites and Limits
&lt;/h3&gt;

&lt;p&gt;This pattern assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FSx for ONTAP is running &lt;strong&gt;ONTAP 9.17.1 or later&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The FSx file system and S3 Access Point are in the &lt;strong&gt;same AWS Region&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;same AWS account&lt;/strong&gt; owns the file system and access point&lt;/li&gt;
&lt;li&gt;Transfer Family file operations follow the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;FSx S3 Access Point compatibility limits&lt;/a&gt;, including the &lt;strong&gt;5 GB upload limit&lt;/strong&gt; and unsupported rename/append operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Solution: SFTP → S3 Access Point → Bedrock KB
&lt;/h3&gt;

&lt;p&gt;This feature bridges AWS Transfer Family with the existing permission-aware RAG pipeline. The architecture aligns with the approach described in the &lt;a href="https://aws.amazon.com/blogs/storage/secure-sftp-file-sharing-with-aws-transfer-family-amazon-fsx-for-netapp-ontap-and-s3-access-points/" rel="noopener noreferrer"&gt;AWS Storage Blog&lt;/a&gt; — internal users access data via SMB/NFS, while external partners use SFTP, all reading/writing to the same FSx for ONTAP file system through S3 Access Points.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Partner │     │ Transfer Family │     │ FSx ONTAP        │
│  (SFTP)  │────▶│ SFTP Server     │────▶│ S3 Access Point  │
└──────────┘     └─────────────────┘     └────────┬─────────┘
                                                   │
                                    ┌──────────────▼──────────────┐
                                    │  EventBridge Scheduler      │
                                    │  (5-min polling)            │
                                    └──────────────┬──────────────┘
                                                   │
                              ┌─────────────────────▼─────────────────────┐
                              │         Ingestion Trigger Lambda           │
                              │  • ListObjectsV2 → detect changes         │
                              │  • Invoke Metadata Generator (async)       │
                              │  • StartIngestionJob (deduplicated)        │
                              └─────────────────────┬─────────────────────┘
                                                    │
                    ┌───────────────────────────────┬┘
                    ▼                               ▼
        ┌───────────────────┐          ┌────────────────────┐
        │ Metadata Generator│          │ Bedrock KB         │
        │ (.metadata.json)  │          │ StartIngestionJob  │
        └───────────────────┘          └────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This remains a polling-based sync path; an event-based CloudTrail/EventBridge mode is listed in What's Next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Design Decisions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. HomeDirectoryMappings uses S3 AP Alias, not ARN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;Transfer Family documentation&lt;/a&gt; explains that FSx-backed Transfer Family access uses S3 Access Point aliases, but the failure mode is not obvious: using the full ARN in &lt;code&gt;HomeDirectoryMappings.Target&lt;/code&gt; produced cryptic access-denied errors in my deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Correct: use alias (e.g., "my-ap-ext-s3alias")&lt;/span&gt;
&lt;span class="nx"&gt;homeDirectoryMappings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;s3AccessPointAlias&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/uploads/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Deduplication via IN_PROGRESS check&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before triggering &lt;code&gt;StartIngestionJob&lt;/code&gt;, the Lambda checks if a job is already running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_trigger_ingestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_job_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_job_status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IN_PROGRESS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Permission metadata auto-generation and trust boundary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a new file is detected without a corresponding &lt;code&gt;.metadata.json&lt;/code&gt;, the Metadata Generator Lambda creates one based on the SFTP user's permission mapping in DynamoDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_sids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"S-1-5-21-xxx-1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_uids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_gids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer-family"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uploaded_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"partner-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uploaded_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-14T10:30:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SFTP user does not supply permission metadata directly. The Metadata Generator derives it from an &lt;strong&gt;administrator-managed DynamoDB mapping&lt;/strong&gt; and writes &lt;code&gt;.metadata.json&lt;/code&gt; using a service role. Partner upload roles are scoped to their home directory (&lt;code&gt;/uploads/{userName}/*&lt;/code&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Security note&lt;/strong&gt;: The SFTP user's IAM role includes an explicit &lt;code&gt;Deny&lt;/code&gt; statement for &lt;code&gt;s3:PutObject&lt;/code&gt; and &lt;code&gt;s3:DeleteObject&lt;/code&gt; on &lt;code&gt;*.metadata.json&lt;/code&gt; keys within their home directory. This prevents partners from overwriting permission metadata generated by the service role.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This integrates seamlessly with the existing permission-filtering RAG pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  CDK Deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx cdk deploy &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;enableTransferFamily&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;s3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:ACCOUNT:accesspoint/my-ap"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;transferFamilyS3ApAlias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"my-ap-ext-s3alias"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. KB Auto-Sync
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Documents on FSx for ONTAP change continuously — new files added, existing files updated. Without automatic synchronization, the Bedrock Knowledge Base becomes stale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A lightweight Lambda (Python 3.12) polls the S3 Access Point every 5 minutes, compares against a DynamoDB inventory, and triggers &lt;code&gt;StartIngestionJob&lt;/code&gt; only when changes are detected. The inventory is updated after &lt;code&gt;StartIngestionJob&lt;/code&gt; is accepted (i.e., a &lt;code&gt;job_id&lt;/code&gt; is returned). A future enhancement will move this to a pending/commit model so ingestion jobs that fail after start do not hide changes from the next scan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Scan → Diff → Start job → Update inventory (on job accepted)
&lt;/span&gt;&lt;span class="n"&gt;current_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scan_s3_access_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_ap_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;previous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trigger_ingestion_if_needed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kb_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ds_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Inventory updated after StartIngestionJob is accepted.
&lt;/span&gt;        &lt;span class="c1"&gt;# Future: move to pending/commit model keyed on job SUCCEEDED.
&lt;/span&gt;        &lt;span class="nf"&gt;update_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable with a single context parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx cdk deploy &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;enableKbAutoSync&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Capacity Guardrails
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;The FSx ONTAP operations automation (volume resize, snapshot management) can be dangerous if triggered too frequently — especially during incidents where monitoring alerts cascade.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A guardrails module that enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-action rate limit&lt;/strong&gt;: Max N executions per action per time window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily cap&lt;/strong&gt;: Maximum total operations per day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooldown&lt;/strong&gt;: Minimum interval between consecutive executions of the same action
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@with_guardrails&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume_resize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_per_hour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daily_cap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cooldown_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resize_volume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_size_gb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Only executes if guardrails pass
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State is tracked in DynamoDB with TTL-based cleanup. The &lt;code&gt;update_item&lt;/code&gt; call uses a &lt;code&gt;ConditionExpression&lt;/code&gt; (&lt;code&gt;attribute_not_exists(action_count) OR action_count &amp;lt; :max_actions&lt;/code&gt;) to prevent concurrent requests from bypassing the daily cap. Concurrent resize requests can still succeed while capacity remains under the configured cap, but the conditional update prevents them from collectively exceeding it. CloudWatch metrics expose guardrail rejections for operational visibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Voice Chat WebRTC (Phase 2)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Knowledge workers often want to ask questions hands-free — during meetings, while reviewing physical documents, or when multitasking.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A Strategy pattern implementation supporting both REST-based (Phase 1) and WebRTC-based (Phase 2) voice interaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;VoiceSessionStrategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;disconnect&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;sendAudio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;ArrayBuffer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;onTranscript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Phase 2 uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Kinesis Video Streams&lt;/strong&gt; Signaling Channel for WebRTC negotiation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipecat Voice Agent&lt;/strong&gt; on Bedrock AgentCore Runtime for speech-to-text-to-RAG-to-speech&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic fallback&lt;/strong&gt;: If WebRTC connection fails, seamlessly falls back to REST-based voice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Phase 2 implements the client/server strategy and fallback behavior; full AgentCore Runtime deployment automation remains in What's Next.&lt;/p&gt;

&lt;p&gt;The WebRTC path is implemented behind the existing voice strategy interface, but production deployments should add authentication, rate limiting, CORS tightening, sanitized logging, and input validation around the signaling and session launch APIs — as noted in the &lt;a href="https://github.com/pipecat-ai/pipecat-examples/tree/main/deployment/aws-agentcore-webrtc-kvs" rel="noopener noreferrer"&gt;Pipecat AgentCore WebRTC KVS example&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing Strategy
&lt;/h2&gt;

&lt;p&gt;All features are backed by comprehensive tests:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Tests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CDK Assertion&lt;/td&gt;
&lt;td&gt;Jest + aws-cdk-lib/assertions&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python Lambda Unit&lt;/td&gt;
&lt;td&gt;pytest + moto&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property-Based&lt;/td&gt;
&lt;td&gt;Hypothesis (Python)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property-Based&lt;/td&gt;
&lt;td&gt;fast-check (TypeScript)&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice WebRTC&lt;/td&gt;
&lt;td&gt;Jest&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart Routing&lt;/td&gt;
&lt;td&gt;Jest + fast-check&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Hypothesis property-based tests verify invariants like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change detection correctly classifies new/changed/unchanged files for any input combination&lt;/li&gt;
&lt;li&gt;Ingestion deduplication logic is correct for all (changes × job_status) combinations&lt;/li&gt;
&lt;li&gt;Metadata JSON always conforms to the required schema regardless of input permissions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security &amp;amp; Portability
&lt;/h2&gt;

&lt;p&gt;Before publishing, we ensured:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No hardcoded AWS account IDs&lt;/strong&gt; in any public source file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameterized ECR repository name&lt;/strong&gt; (&lt;code&gt;ecrRepositoryName&lt;/code&gt; CDK prop)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameterized REGION&lt;/strong&gt; in all shell scripts (&lt;code&gt;${AWS_REGION:-ap-northeast-1}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Masked screenshots&lt;/strong&gt; — AWS account IDs in console screenshots are covered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.gitignore&lt;/code&gt; coverage&lt;/strong&gt; — &lt;code&gt;cdk.context.json&lt;/code&gt;, &lt;code&gt;cdk.out/&lt;/code&gt;, &lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;.hypothesis/&lt;/code&gt; all excluded&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Runtime deployment&lt;/strong&gt; for the Pipecat Voice Agent (currently requires CLI — CloudFormation support pending)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail/EventBridge mode&lt;/strong&gt; for Transfer Family ingestion (near-real-time event-based detection instead of 5-minute polling)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end SFTP upload test&lt;/strong&gt; with actual SSH keys and partner simulation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  End-to-End Architecture Flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐     ┌─────────────────┐     ┌──────────────────────────┐
│ External     │     │ Transfer Family │     │ FSx for ONTAP            │
│ Partner      │────▶│ SFTP Server     │────▶│ S3 Access Point          │
│ (SFTP)       │     └─────────────────┘     │ (data stays on FSxN)     │
└──────────────┘                              └────────────┬─────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Metadata Generator Lambda   │
                                            │ (admin-managed permissions) │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ KB Auto-Sync / Ingestion    │
                                            │ Trigger Lambda              │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Amazon Bedrock              │
                                            │ Knowledge Base              │
                                            └──────────────┬──────────────┘
                                                           │
┌──────────────┐     ┌─────────────────┐     ┌────────────▼─────────────┐
│ End User     │────▶│ Smart Routing   │────▶│ Permission-Aware RAG     │
│ (Chat/Voice) │     │ (Haiku/Sonnet/  │     │ (fail-closed: missing    │
└──────────────┘     │  Opus)          │     │  metadata = excluded)    │
                     └─────────────────┘     └──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RAG retrieval path is designed to fail closed: if permission metadata is missing, malformed, or unverifiable for a document, that document is excluded from retrieval results rather than exposed broadly. This fail-closed behavior is the core safety boundary of the permission-aware RAG design: a document without trusted metadata is treated as not retrievable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;p&gt;v4.2 is production-oriented, but a few items remain follow-up work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KB Auto-Sync&lt;/strong&gt; currently updates inventory when &lt;code&gt;StartIngestionJob&lt;/code&gt; is accepted rather than when the job reaches &lt;code&gt;SUCCEEDED&lt;/code&gt;. Failed ingestion jobs may mask unprocessed changes until the pending/commit model is implemented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer Family ingestion&lt;/strong&gt; is implemented and unit-tested; full partner-style E2E validation with SSH keys is still planned. The current auto-sync path focuses on detecting additions and updates — delete reconciliation is follow-up work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Runtime&lt;/strong&gt; deployment automation is not yet CloudFormation-based; the Pipecat Voice Agent requires CLI/SDK deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice sessions&lt;/strong&gt; require production policies for authentication, rate limiting, transcript retention, and sanitized logging before production rollout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Routing&lt;/strong&gt; emits routing metrics, but monthly cost dashboards, budget enforcement, and savings-vs-baseline reporting are follow-up work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail-closed enforcement&lt;/strong&gt; happens in the retrieval filtering layer: documents without valid, trusted permission metadata are excluded before the model receives context. Audit events for retrieval decisions (&lt;code&gt;DocumentSuppressedByPermission&lt;/code&gt;) are candidates for the next release.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual high-cost or preview model selection (GPT-5.5) should be governed by application-level authorization and audited separately from automatic routing. The networking model — public Transfer Family endpoint vs VPC-hosted endpoint, partner IP allowlists, and private DNS requirements — should be selected per customer environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Should Care About v4.2?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI platform teams&lt;/strong&gt; get model routing that balances quality and cost without manual intervention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt; get administrator-derived permission metadata and explicit IAM protection against metadata overwrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data teams&lt;/strong&gt; get automatic KB synchronization from FSx for ONTAP through S3 Access Points.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partners and SIs&lt;/strong&gt; get an SFTP-to-RAG ingestion path for customers who exchange documents with external organizations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations teams&lt;/strong&gt; get guardrails for FSx ONTAP automation actions with conditional write protection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application teams&lt;/strong&gt; get a WebRTC voice strategy with REST fallback.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;v4.2 moves the permission-aware RAG system from a secure document Q&amp;amp;A application toward an enterprise ingestion and interaction platform.&lt;/p&gt;

&lt;p&gt;Smart Routing reduces model cost without removing access to stronger models. Transfer Family ingestion lets partners keep using SFTP while documents land directly on FSx for ONTAP through S3 Access Points. KB Auto-Sync keeps Bedrock Knowledge Bases fresh, Capacity Guardrails make ONTAP automation safer, and WebRTC Voice Chat opens a lower-friction interaction path.&lt;/p&gt;

&lt;p&gt;The common theme is the same as the FSx for ONTAP S3 Access Points pattern series: keep enterprise file data on FSx for ONTAP, expose it safely through S3-compatible access paths, and automate around it with serverless and managed AWS services.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG" rel="noopener noreferrer"&gt;FSx-for-ONTAP-Agentic-Access-Aware-RAG&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG/releases/tag/v4.2.0" rel="noopener noreferrer"&gt;v4.2.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related series&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;FSx for ONTAP S3 Access Points Serverless Patterns&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Blog&lt;/strong&gt;: &lt;a href="https://aws.amazon.com/blogs/storage/secure-sftp-file-sharing-with-aws-transfer-family-amazon-fsx-for-netapp-ontap-and-s3-access-points/" rel="noopener noreferrer"&gt;Secure SFTP file sharing with AWS Transfer Family, Amazon FSx for NetApp ONTAP, and S3 Access Points&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Docs&lt;/strong&gt;: &lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;Access your FSx for NetApp ONTAP file systems with Transfer Family&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>amazonfsxfornetappontap</category>
      <category>serverless</category>
      <category>amazonbedrock</category>
    </item>
    <item>
      <title>Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Thu, 14 May 2026 12:23:40 +0000</pubDate>
      <link>https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3495</link>
      <guid>https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3495</guid>
      <description>&lt;h2&gt;
  
  
  What This Post Covers
&lt;/h2&gt;

&lt;p&gt;This is a companion article to the &lt;a href="https://dev.to/yoshikifujiwara/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;FSx for ONTAP S3 Access Points Serverless Patterns&lt;/a&gt; series. While that series focuses on serverless patterns for FSx for ONTAP S3 Access Points across industries, this post covers the &lt;strong&gt;v4.2 release&lt;/strong&gt; of the &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG" rel="noopener noreferrer"&gt;Agentic Access-Aware RAG&lt;/a&gt; system — a permission-aware RAG application built on FSx for ONTAP + Amazon Bedrock, production-grade in the sense of CI coverage, permission filtering, guardrails, and deployment parameterization — while some v4.2 features still have follow-up E2E items listed in What's Next.&lt;/p&gt;

&lt;p&gt;The v4.2 release adds five features that address real-world enterprise needs: intelligent model routing for cost optimization, SFTP-based document ingestion for partners who can't use web UIs, automatic KB synchronization, operational guardrails for FSx ONTAP automation, and voice-based interaction via WebRTC.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Smart Routing Model Expansion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Enterprise RAG workloads have wildly different complexity levels. A simple "What's the office address?" query doesn't need the same model as "Analyze the Q4 financial report across all subsidiaries and identify cost reduction opportunities." Routing everything through a single model either wastes money or delivers poor quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: 3-Tier Automatic Routing
&lt;/h3&gt;

&lt;p&gt;The default routing tiers are configured for the model set currently enabled in this deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple&lt;/strong&gt; (greetings, factual lookups) → Claude Haiku 4.5 (&lt;code&gt;anthropic.claude-haiku-4-5-20251001-v1:0&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex&lt;/strong&gt; (analysis, comparison, summarization) → Claude 3.5 Sonnet v2 (&lt;code&gt;anthropic.claude-3-5-sonnet-20241022-v2:0&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-context&lt;/strong&gt; (multi-document reasoning, financial analysis) → Claude Opus 4 (&lt;code&gt;anthropic.claude-opus-4-0-20250514-v1:0&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact model IDs are deployment parameters (&lt;code&gt;lightweightModelId&lt;/code&gt;, &lt;code&gt;powerfulModelId&lt;/code&gt;, &lt;code&gt;heavyModelId&lt;/code&gt;), so teams can update to newer Sonnet/Opus releases without changing the routing logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                  User Query                          │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Complexity     │
              │  Classifier     │
              └───┬────┬────┬───┘
                  │    │    │
         Simple   │    │    │  Full-context
                  ▼    ▼    ▼
        ┌──────┐ ┌──────┐ ┌──────┐
        │Haiku │ │Sonnet│ │ Opus │
        │ 4.5  │ │3.5 v2│ │  4   │
        └──────┘ └──────┘ └──────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost labels below are illustrative per-query estimates for typical RAG prompts (~1K input tokens, ~500 output tokens) in this deployment, not fixed model prices. Actual cost depends on input/output tokens, prompt caching, region, and inference configuration.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Illustrative per-query cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;~$0.001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 3.5 v2&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4&lt;/td&gt;
&lt;td&gt;~$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Additionally, GPT-5.5 can be exposed as a manual selection option when OpenAI models on Amazon Bedrock are enabled for the account. In this deployment, the manual route is parameterized as &lt;code&gt;openai.gpt-5-5&lt;/code&gt;, but teams should verify the exact model ID, Region availability, inference profile, and preview access status in their own AWS account.&lt;/p&gt;

&lt;p&gt;If the selected model is unavailable or throttled, the router falls back to the next configured tier and emits a &lt;code&gt;RoutingFallback&lt;/code&gt; metric.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;The classifier analyzes query characteristics — keyword count, presence of analytical terms, document references, context size — and routes to the appropriate tier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// complexity-classifier.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;classifyQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;contextSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ClassificationResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractFeatures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isGreeting&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;wordCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hasAnalyticalTerms&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;contextSize&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;full-context&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CloudWatch EMF metrics track routing decisions, enabling cost analysis and route distribution monitoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SmartRouting&lt;/span&gt;
&lt;span class="py"&gt;Metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoutingCount&lt;/span&gt;
&lt;span class="py"&gt;Dimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoutingTier (simple | complex | full-context | manual)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Transfer Family FSx ONTAP Ingestion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Many enterprise partners — law firms, auditors, regulatory bodies — exchange documents via SFTP. They won't adopt a web UI. But their documents still need to flow into the RAG knowledge base with proper permission metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites and Limits
&lt;/h3&gt;

&lt;p&gt;This pattern assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FSx for ONTAP is running &lt;strong&gt;ONTAP 9.17.1 or later&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The FSx file system and S3 Access Point are in the &lt;strong&gt;same AWS Region&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;same AWS account&lt;/strong&gt; owns the file system and access point&lt;/li&gt;
&lt;li&gt;Transfer Family file operations follow the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;FSx S3 Access Point compatibility limits&lt;/a&gt;, including the &lt;strong&gt;5 GB upload limit&lt;/strong&gt; and unsupported rename/append operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Solution: SFTP → S3 Access Point → Bedrock KB
&lt;/h3&gt;

&lt;p&gt;This feature bridges AWS Transfer Family with the existing permission-aware RAG pipeline. The architecture aligns with the approach described in the &lt;a href="https://aws.amazon.com/blogs/storage/secure-sftp-file-sharing-with-aws-transfer-family-amazon-fsx-for-netapp-ontap-and-s3-access-points/" rel="noopener noreferrer"&gt;AWS Storage Blog&lt;/a&gt; — internal users access data via SMB/NFS, while external partners use SFTP, all reading/writing to the same FSx for ONTAP file system through S3 Access Points.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Partner │     │ Transfer Family │     │ FSx ONTAP        │
│  (SFTP)  │────▶│ SFTP Server     │────▶│ S3 Access Point  │
└──────────┘     └─────────────────┘     └────────┬─────────┘
                                                   │
                                    ┌──────────────▼──────────────┐
                                    │  EventBridge Scheduler      │
                                    │  (5-min polling)            │
                                    └──────────────┬──────────────┘
                                                   │
                              ┌─────────────────────▼─────────────────────┐
                              │         Ingestion Trigger Lambda           │
                              │  • ListObjectsV2 → detect changes         │
                              │  • Invoke Metadata Generator (async)       │
                              │  • StartIngestionJob (deduplicated)        │
                              └─────────────────────┬─────────────────────┘
                                                    │
                    ┌───────────────────────────────┬┘
                    ▼                               ▼
        ┌───────────────────┐          ┌────────────────────┐
        │ Metadata Generator│          │ Bedrock KB         │
        │ (.metadata.json)  │          │ StartIngestionJob  │
        └───────────────────┘          └────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This remains a polling-based sync path; an event-based CloudTrail/EventBridge mode is listed in What's Next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Design Decisions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. HomeDirectoryMappings uses S3 AP Alias, not ARN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;Transfer Family documentation&lt;/a&gt; explains that FSx-backed Transfer Family access uses S3 Access Point aliases, but the failure mode is not obvious: using the full ARN in &lt;code&gt;HomeDirectoryMappings.Target&lt;/code&gt; produced cryptic access-denied errors in my deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Correct: use alias (e.g., "my-ap-ext-s3alias")&lt;/span&gt;
&lt;span class="nx"&gt;homeDirectoryMappings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;s3AccessPointAlias&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/uploads/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Deduplication via IN_PROGRESS check&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before triggering &lt;code&gt;StartIngestionJob&lt;/code&gt;, the Lambda checks if a job is already running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_trigger_ingestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_job_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_job_status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IN_PROGRESS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Permission metadata auto-generation and trust boundary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a new file is detected without a corresponding &lt;code&gt;.metadata.json&lt;/code&gt;, the Metadata Generator Lambda creates one based on the SFTP user's permission mapping in DynamoDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_sids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"S-1-5-21-xxx-1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_uids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_gids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"1001"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer-family"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uploaded_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"partner-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uploaded_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-14T10:30:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SFTP user does not supply permission metadata directly. The Metadata Generator derives it from an &lt;strong&gt;administrator-managed DynamoDB mapping&lt;/strong&gt; and writes &lt;code&gt;.metadata.json&lt;/code&gt; using a service role. Partner upload roles are scoped to their home directory (&lt;code&gt;/uploads/{userName}/*&lt;/code&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Security note&lt;/strong&gt;: The SFTP user's IAM role includes an explicit &lt;code&gt;Deny&lt;/code&gt; statement for &lt;code&gt;s3:PutObject&lt;/code&gt; and &lt;code&gt;s3:DeleteObject&lt;/code&gt; on &lt;code&gt;*.metadata.json&lt;/code&gt; keys within their home directory. This prevents partners from overwriting permission metadata generated by the service role.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This integrates seamlessly with the existing permission-filtering RAG pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  CDK Deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx cdk deploy &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;enableTransferFamily&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;s3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:ACCOUNT:accesspoint/my-ap"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;transferFamilyS3ApAlias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"my-ap-ext-s3alias"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. KB Auto-Sync
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Documents on FSx for ONTAP change continuously — new files added, existing files updated. Without automatic synchronization, the Bedrock Knowledge Base becomes stale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A lightweight Lambda (Python 3.12) polls the S3 Access Point every 5 minutes, compares against a DynamoDB inventory, and triggers &lt;code&gt;StartIngestionJob&lt;/code&gt; only when changes are detected. The inventory is updated after &lt;code&gt;StartIngestionJob&lt;/code&gt; is accepted (i.e., a &lt;code&gt;job_id&lt;/code&gt; is returned). A future enhancement will move this to a pending/commit model so ingestion jobs that fail after start do not hide changes from the next scan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Scan → Diff → Start job → Update inventory (on job accepted)
&lt;/span&gt;&lt;span class="n"&gt;current_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scan_s3_access_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_ap_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;previous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;has_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trigger_ingestion_if_needed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kb_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ds_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Inventory updated after StartIngestionJob is accepted.
&lt;/span&gt;        &lt;span class="c1"&gt;# Future: move to pending/commit model keyed on job SUCCEEDED.
&lt;/span&gt;        &lt;span class="nf"&gt;update_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable with a single context parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx cdk deploy &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nv"&gt;enableKbAutoSync&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Capacity Guardrails
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;The FSx ONTAP operations automation (volume resize, snapshot management) can be dangerous if triggered too frequently — especially during incidents where monitoring alerts cascade.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A guardrails module that enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-action rate limit&lt;/strong&gt;: Max N executions per action per time window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily cap&lt;/strong&gt;: Maximum total operations per day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooldown&lt;/strong&gt;: Minimum interval between consecutive executions of the same action
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@with_guardrails&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume_resize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_per_hour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daily_cap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cooldown_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resize_volume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_size_gb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Only executes if guardrails pass
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State is tracked in DynamoDB with TTL-based cleanup. The &lt;code&gt;update_item&lt;/code&gt; call uses a &lt;code&gt;ConditionExpression&lt;/code&gt; (&lt;code&gt;attribute_not_exists(action_count) OR action_count &amp;lt; :max_actions&lt;/code&gt;) to prevent concurrent requests from bypassing the daily cap. Concurrent resize requests can still succeed while capacity remains under the configured cap, but the conditional update prevents them from collectively exceeding it. CloudWatch metrics expose guardrail rejections for operational visibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Voice Chat WebRTC (Phase 2)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Knowledge workers often want to ask questions hands-free — during meetings, while reviewing physical documents, or when multitasking.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;A Strategy pattern implementation supporting both REST-based (Phase 1) and WebRTC-based (Phase 2) voice interaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;VoiceSessionStrategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;disconnect&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;sendAudio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;ArrayBuffer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;onTranscript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Phase 2 uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Kinesis Video Streams&lt;/strong&gt; Signaling Channel for WebRTC negotiation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipecat Voice Agent&lt;/strong&gt; on Bedrock AgentCore Runtime for speech-to-text-to-RAG-to-speech&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic fallback&lt;/strong&gt;: If WebRTC connection fails, seamlessly falls back to REST-based voice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Phase 2 implements the client/server strategy and fallback behavior; full AgentCore Runtime deployment automation remains in What's Next.&lt;/p&gt;

&lt;p&gt;The WebRTC path is implemented behind the existing voice strategy interface, but production deployments should add authentication, rate limiting, CORS tightening, sanitized logging, and input validation around the signaling and session launch APIs — as noted in the &lt;a href="https://github.com/pipecat-ai/pipecat-examples/tree/main/deployment/aws-agentcore-webrtc-kvs" rel="noopener noreferrer"&gt;Pipecat AgentCore WebRTC KVS example&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing Strategy
&lt;/h2&gt;

&lt;p&gt;All features are backed by comprehensive tests:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Tests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CDK Assertion&lt;/td&gt;
&lt;td&gt;Jest + aws-cdk-lib/assertions&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python Lambda Unit&lt;/td&gt;
&lt;td&gt;pytest + moto&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property-Based&lt;/td&gt;
&lt;td&gt;Hypothesis (Python)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property-Based&lt;/td&gt;
&lt;td&gt;fast-check (TypeScript)&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice WebRTC&lt;/td&gt;
&lt;td&gt;Jest&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart Routing&lt;/td&gt;
&lt;td&gt;Jest + fast-check&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Hypothesis property-based tests verify invariants like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change detection correctly classifies new/changed/unchanged files for any input combination&lt;/li&gt;
&lt;li&gt;Ingestion deduplication logic is correct for all (changes × job_status) combinations&lt;/li&gt;
&lt;li&gt;Metadata JSON always conforms to the required schema regardless of input permissions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security &amp;amp; Portability
&lt;/h2&gt;

&lt;p&gt;Before publishing, we ensured:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No hardcoded AWS account IDs&lt;/strong&gt; in any public source file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameterized ECR repository name&lt;/strong&gt; (&lt;code&gt;ecrRepositoryName&lt;/code&gt; CDK prop)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameterized REGION&lt;/strong&gt; in all shell scripts (&lt;code&gt;${AWS_REGION:-ap-northeast-1}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Masked screenshots&lt;/strong&gt; — AWS account IDs in console screenshots are covered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.gitignore&lt;/code&gt; coverage&lt;/strong&gt; — &lt;code&gt;cdk.context.json&lt;/code&gt;, &lt;code&gt;cdk.out/&lt;/code&gt;, &lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;.hypothesis/&lt;/code&gt; all excluded&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Runtime deployment&lt;/strong&gt; for the Pipecat Voice Agent (currently requires CLI — CloudFormation support pending)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail/EventBridge mode&lt;/strong&gt; for Transfer Family ingestion (near-real-time event-based detection instead of 5-minute polling)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end SFTP upload test&lt;/strong&gt; with actual SSH keys and partner simulation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  End-to-End Architecture Flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐     ┌─────────────────┐     ┌──────────────────────────┐
│ External     │     │ Transfer Family │     │ FSx for ONTAP            │
│ Partner      │────▶│ SFTP Server     │────▶│ S3 Access Point          │
│ (SFTP)       │     └─────────────────┘     │ (data stays on FSxN)     │
└──────────────┘                              └────────────┬─────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Metadata Generator Lambda   │
                                            │ (admin-managed permissions) │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ KB Auto-Sync / Ingestion    │
                                            │ Trigger Lambda              │
                                            └──────────────┬──────────────┘
                                                           │
                                            ┌──────────────▼──────────────┐
                                            │ Amazon Bedrock              │
                                            │ Knowledge Base              │
                                            └──────────────┬──────────────┘
                                                           │
┌──────────────┐     ┌─────────────────┐     ┌────────────▼─────────────┐
│ End User     │────▶│ Smart Routing   │────▶│ Permission-Aware RAG     │
│ (Chat/Voice) │     │ (Haiku/Sonnet/  │     │ (fail-closed: missing    │
└──────────────┘     │  Opus)          │     │  metadata = excluded)    │
                     └─────────────────┘     └──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RAG retrieval path is designed to fail closed: if permission metadata is missing, malformed, or unverifiable for a document, that document is excluded from retrieval results rather than exposed broadly. This fail-closed behavior is the core safety boundary of the permission-aware RAG design: a document without trusted metadata is treated as not retrievable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;p&gt;v4.2 is production-oriented, but a few items remain follow-up work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KB Auto-Sync&lt;/strong&gt; currently updates inventory when &lt;code&gt;StartIngestionJob&lt;/code&gt; is accepted rather than when the job reaches &lt;code&gt;SUCCEEDED&lt;/code&gt;. Failed ingestion jobs may mask unprocessed changes until the pending/commit model is implemented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer Family ingestion&lt;/strong&gt; is implemented and unit-tested; full partner-style E2E validation with SSH keys is still planned. The current auto-sync path focuses on detecting additions and updates — delete reconciliation is follow-up work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Runtime&lt;/strong&gt; deployment automation is not yet CloudFormation-based; the Pipecat Voice Agent requires CLI/SDK deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice sessions&lt;/strong&gt; require production policies for authentication, rate limiting, transcript retention, and sanitized logging before production rollout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Routing&lt;/strong&gt; emits routing metrics, but monthly cost dashboards, budget enforcement, and savings-vs-baseline reporting are follow-up work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail-closed enforcement&lt;/strong&gt; happens in the retrieval filtering layer: documents without valid, trusted permission metadata are excluded before the model receives context. Audit events for retrieval decisions (&lt;code&gt;DocumentSuppressedByPermission&lt;/code&gt;) are candidates for the next release.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual high-cost or preview model selection (GPT-5.5) should be governed by application-level authorization and audited separately from automatic routing. The networking model — public Transfer Family endpoint vs VPC-hosted endpoint, partner IP allowlists, and private DNS requirements — should be selected per customer environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Should Care About v4.2?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI platform teams&lt;/strong&gt; get model routing that balances quality and cost without manual intervention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt; get administrator-derived permission metadata and explicit IAM protection against metadata overwrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data teams&lt;/strong&gt; get automatic KB synchronization from FSx for ONTAP through S3 Access Points.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partners and SIs&lt;/strong&gt; get an SFTP-to-RAG ingestion path for customers who exchange documents with external organizations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations teams&lt;/strong&gt; get guardrails for FSx ONTAP automation actions with conditional write protection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application teams&lt;/strong&gt; get a WebRTC voice strategy with REST fallback.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;v4.2 moves the permission-aware RAG system from a secure document Q&amp;amp;A application toward an enterprise ingestion and interaction platform.&lt;/p&gt;

&lt;p&gt;Smart Routing reduces model cost without removing access to stronger models. Transfer Family ingestion lets partners keep using SFTP while documents land directly on FSx for ONTAP through S3 Access Points. KB Auto-Sync keeps Bedrock Knowledge Bases fresh, Capacity Guardrails make ONTAP automation safer, and WebRTC Voice Chat opens a lower-friction interaction path.&lt;/p&gt;

&lt;p&gt;The common theme is the same as the FSx for ONTAP S3 Access Points pattern series: keep enterprise file data on FSx for ONTAP, expose it safely through S3-compatible access paths, and automate around it with serverless and managed AWS services.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG" rel="noopener noreferrer"&gt;FSx-for-ONTAP-Agentic-Access-Aware-RAG&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG/releases/tag/v4.2.0" rel="noopener noreferrer"&gt;v4.2.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related series&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;FSx for ONTAP S3 Access Points Serverless Patterns&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Blog&lt;/strong&gt;: &lt;a href="https://aws.amazon.com/blogs/storage/secure-sftp-file-sharing-with-aws-transfer-family-amazon-fsx-for-netapp-ontap-and-s3-access-points/" rel="noopener noreferrer"&gt;Secure SFTP file sharing with AWS Transfer Family, Amazon FSx for NetApp ONTAP, and S3 Access Points&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Docs&lt;/strong&gt;: &lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;Access your FSx for NetApp ONTAP file systems with Transfer Family&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>amazonfsxfornetappontap</category>
      <category>serverless</category>
      <category>amazonbedrock</category>
    </item>
    <item>
      <title>FPolicy Event-Driven Pipeline, Multi-Account StackSets, and Cost Optimization — FSx for ONTAP S3 Access Points, Phase 10</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Thu, 14 May 2026 04:44:54 +0000</pubDate>
      <link>https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6</link>
      <guid>https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;Phase 10&lt;/strong&gt; of the FSx for ONTAP S3 Access Points serverless pattern library. Building on &lt;a href="https://dev.to/yoshikifujiwara/production-rollout-vpc-endpoint-auto-detection-and-the-cdk-no-go-fsx-for-ontap-s3-access-3lni"&gt;Phase 9&lt;/a&gt;, Phase 10 delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy event-driven integration&lt;/strong&gt;: ONTAP FPolicy → ECS Fargate TCP server → SQS → EventBridge custom bus. The shared event-ingestion pipeline is verified end-to-end; UC-specific dispatch follows in Phase 11.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-account StackSets&lt;/strong&gt;: All 17 UC templates validated for StackSets compatibility (0 errors) + admin/execution role templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UC-specific alarm profiles&lt;/strong&gt;: BATCH / REALTIME / HIGH_VOLUME — three profiles with workload-appropriate thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt;: Dynamic MaxConcurrency controller + business-hours scheduling (rate(1h) vs rate(6h))&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E2E verification&lt;/strong&gt;: NFSv3 ✅, NFSv4.0 ✅, NFSv4.1 ✅, SMB ✅, NFSv4.2 ❌ (unsupported by ONTAP FPolicy)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In short&lt;/strong&gt;: Phase 9 completed the operational baseline. Phase 10 builds and verifies the shared event-ingestion pipeline that the pattern library has needed since Phase 1 — without waiting for AWS to ship native S3AP notifications. UC-specific dispatch wiring follows in Phase 11.&lt;/p&gt;

&lt;p&gt;📊 &lt;strong&gt;Repository stats&lt;/strong&gt;: 17 industry use cases + event-driven FPolicy + 6 FlexCache/FlexClone patterns | 1,499+ tests | 126 test files | Python 3.12 + CloudFormation (SAM Transform)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1a. Trigger Mode Decision Guide
&lt;/h2&gt;

&lt;p&gt;Before diving into the FPolicy implementation, here is the decision framework for choosing between the three trigger modes this library supports:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Choose when&lt;/th&gt;
&lt;th&gt;Avoid when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;POLLING&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hourly or batch processing is acceptable; simplest operating model&lt;/td&gt;
&lt;td&gt;Sub-minute detection is required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EVENT_DRIVEN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Near-real-time ingestion is required and event loss during reconnect is acceptable&lt;/td&gt;
&lt;td&gt;Compliance requires durable event capture without Persistent Store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HYBRID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You need faster detection plus periodic reconciliation to fill gaps&lt;/td&gt;
&lt;td&gt;You want the simplest operating model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;POLLING&lt;/th&gt;
&lt;th&gt;EVENT_DRIVEN&lt;/th&gt;
&lt;th&gt;HYBRID&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Detection latency&lt;/td&gt;
&lt;td&gt;Minutes to hours&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Seconds + periodic catch-up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost (infra only)&lt;/td&gt;
&lt;td&gt;~$6-21&lt;/td&gt;
&lt;td&gt;~$32-60&lt;/td&gt;
&lt;td&gt;~$42-86&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational complexity&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event durability&lt;/td&gt;
&lt;td&gt;High (full scan each time)&lt;/td&gt;
&lt;td&gt;Medium (gap during restart)&lt;/td&gt;
&lt;td&gt;High (reconciliation fills gaps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP dependency&lt;/td&gt;
&lt;td&gt;None (S3 AP only)&lt;/td&gt;
&lt;td&gt;High (FPolicy config)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Decision flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Real-time detection not required → &lt;strong&gt;POLLING&lt;/strong&gt; (start here for most workloads)&lt;/li&gt;
&lt;li&gt;Real-time required + Persistent Store available (ONTAP 9.14.1+) → &lt;strong&gt;EVENT_DRIVEN&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Real-time required + no Persistent Store → &lt;strong&gt;HYBRID&lt;/strong&gt; (polling fills gaps)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Full guide: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/trigger-mode-decision-guide.md" rel="noopener noreferrer"&gt;Trigger Mode Decision Guide&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. FPolicy Event-Driven Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Background: why FPolicy
&lt;/h3&gt;

&lt;p&gt;Every UC in this pattern library runs on a polling model: EventBridge Scheduler → Discovery Lambda → ListObjectsV2. This works, but it means latency is bounded by the polling interval (typically 1 hour). AWS still does not support &lt;code&gt;GetBucketNotificationConfiguration&lt;/code&gt; for S3 Access Points attached to FSx for ONTAP volumes (FR-2 remains open).&lt;/p&gt;

&lt;p&gt;ONTAP FPolicy is a file-operation notification framework built into every ONTAP system. In external server mode, it sends TCP notifications for create/write/delete/rename events to a registered server. By connecting this to AWS services, we get near-real-time event-driven processing without waiting for FR-2.&lt;/p&gt;

&lt;p&gt;This implementation builds on &lt;a href="https://github.com/YhunerFSY/ontap-fpolicy-aws-integration" rel="noopener noreferrer"&gt;Shengyu Fang's reference implementation&lt;/a&gt;, adapted for the 17-UC pattern library architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 API Does Not Remove File-System Semantics
&lt;/h3&gt;

&lt;p&gt;S3 Access Points for FSx for ONTAP expose file data through S3 APIs, but authorization is a two-layer model:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AWS-side authorization&lt;/strong&gt;: IAM identity-based policy, S3 Access Point resource policy, VPC endpoint policy, SCP — all relevant policies are evaluated and all must permit the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File-system-side authorization&lt;/strong&gt;: The file system identity (UNIX UID or Windows domain\user) associated with the access point determines what file operations are authorized based on that user's permissions on the underlying volume&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means that least-privilege design must cover &lt;strong&gt;both&lt;/strong&gt; AWS IAM and ONTAP file permissions. A common mistake is securing only the IAM layer while using a root-equivalent file system identity (UID 0), which grants full access to all files regardless of IAM restrictions.&lt;/p&gt;

&lt;p&gt;Key behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the file system user has read-only access, write requests through the access point are blocked — even if IAM permits &lt;code&gt;s3:PutObject&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Attaching an S3 access point does &lt;strong&gt;not&lt;/strong&gt; change the volume's behavior when accessed via NFS or SMB&lt;/li&gt;
&lt;li&gt;Block Public Access is always enabled and cannot be changed for FSx for ONTAP access points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the full authorization model documentation, see &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/s3ap-authorization-model.md" rel="noopener noreferrer"&gt;S3AP Authorization Model&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FSx ONTAP SVM (file operations: create/write/delete/rename)
│
│ TCP (port 9898, async mode)
▼
FPolicy External Server (ECS Fargate, ARM64 Python 3.12)
│
├─ [Near-real-time] → SQS Ingestion Queue
│                        │
│                        │ Event Source Mapping
│                        ▼
│                     Bridge Lambda → EventBridge Custom Bus
│                                          │
│                                   UC1 reference rule (Phase 10)
│                                          │
│                                   UC1 Step Functions
│
│                                   ── Phase 11 ──
│                                   UC-specific dispatch rules
│                                   → Step Functions / Lambda (per-UC)
│
└─ [Batch] → JSON Lines log (FSxN S3AP) → Log Query Lambda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ONTAP initiates the TCP connection to the FPolicy server — not the other way around. This means the server simply listens on a port. Because ONTAP maintains a persistent TCP control channel with keep-alive, Lambda is not viable (15-minute timeout). ECS Fargate provides the long-running TCP listener without OS management overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not NLB?
&lt;/h3&gt;

&lt;p&gt;Initial design placed an NLB in front of Fargate for IP stability. In our AWS verification, the NLB path established a TCP connection but the FPolicy handshake did not complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional verification (2026-05-14)&lt;/strong&gt;: We tested both &lt;code&gt;preserve_client_ip.enabled=true&lt;/code&gt; and &lt;code&gt;false&lt;/code&gt; on the NLB target group. In both configurations, ONTAP did not establish an FPolicy session through the NLB. The only connections observed from the NLB IP were health checks (TCP connect → immediate close at 10-second intervals). No FPolicy NEGO_REQ was received via the NLB path.&lt;/p&gt;

&lt;p&gt;One plausible explanation is that ONTAP FPolicy's external-engine expects a direct TCP connection to the primary-servers IP. When the NLB forwards the connection to a Fargate task with a different IP, the FPolicy session establishment conditions are not met — possibly because ONTAP validates the connection endpoint or because the NLB's connection lifecycle (idle timeout, deregistration delay) interferes with the persistent control channel that FPolicy requires.&lt;/p&gt;

&lt;p&gt;This remains documented as an observed deployment limitation in our environment (FSxN ONTAP 9.17.1P6, internal NLB with IP targets), not a universal NLB claim. If your environment differs, testing the NLB path is straightforward — set the NLB IP as the external-engine primary-servers and check &lt;code&gt;vserver fpolicy show-engine&lt;/code&gt; for connection state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Fargate task direct IP connection. IP stability is handled by an EventBridge-triggered Lambda that updates the ONTAP external-engine configuration when the Fargate task IP changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ECS Task State Change (RUNNING) → EventBridge Rule → IP Updater Lambda
→ ONTAP REST API: disable policy → update engine primary_servers → enable policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The direct-IP model assumes a single active Fargate task (&lt;code&gt;DesiredCount: 1&lt;/code&gt;) and requires network reachability from the FSxN SVM data LIFs to the task ENI on the FPolicy TCP port. This design prioritizes connection stability over horizontal scalability; multi-task active-active configurations are not supported due to FPolicy session constraints. Security groups must allow ONTAP-initiated inbound connections on port 9898. During Fargate task restarts, event handling depends on the FPolicy policy's &lt;code&gt;is-mandatory&lt;/code&gt; setting: with &lt;code&gt;is-mandatory=false&lt;/code&gt; (our configuration), file operations continue unblocked but notifications are dropped until the new task connects. See the Event durability note below for Persistent Store guidance.&lt;/p&gt;

&lt;h3&gt;
  
  
  TriggerMode parameter
&lt;/h3&gt;

&lt;p&gt;Phase 10 introduces the &lt;code&gt;TriggerMode&lt;/code&gt; parameter scaffolding and verifies the shared FPolicy → SQS → EventBridge pipeline end-to-end. A reference implementation is deployed in the legal-compliance (UC1) template. UC-specific Step Functions dispatch rules are intentionally deferred to Phase 11.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Phase 10 behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;POLLING&lt;/code&gt; (default)&lt;/td&gt;
&lt;td&gt;Existing EventBridge Scheduler + Discovery Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EVENT_DRIVEN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shared FPolicy event pipeline enabled; UC-specific dispatch wiring is Phase 11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HYBRID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Polling remains active; event-driven deduplication path prepared for Phase 11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Default &lt;code&gt;POLLING&lt;/code&gt; ensures zero impact on existing deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  NFSv3 write-complete delay
&lt;/h3&gt;

&lt;p&gt;When FPolicy fires a notification, the file write may not be complete — particularly with NFSv3 which lacks close semantics. The server inserts a configurable delay (&lt;code&gt;WRITE_COMPLETE_DELAY_SEC&lt;/code&gt;, default 5s) after receiving NOTI_REQ, and Step Functions include retry logic for incomplete files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event durability note
&lt;/h3&gt;

&lt;p&gt;This Phase 10 implementation is designed for near-real-time processing, not end-to-end durable event capture during Fargate task restarts. With &lt;code&gt;is-mandatory=false&lt;/code&gt;, ONTAP drops notifications when no FPolicy server is connected — file operations continue unblocked but events are lost. Environments that cannot tolerate event loss should evaluate ONTAP FPolicy Persistent Store (ONTAP 9.14.1+), available for asynchronous non-mandatory external FPolicy policies. Persistent Store queues events on the SVM during server disconnection and can replay them when the external server reconnects. Note that queue sizing, replay handling, and deduplication require application-level design. This is a Phase 11+ candidate (design-dependent).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note (Phase 12 update)&lt;/strong&gt;: This Phase 10 article documents the initial event-durability boundary. Persistent Store replay validation is covered in &lt;a href="https://dev.to/yoshikifujiwara"&gt;Phase 12&lt;/a&gt;, where replay behavior was tested for 5-event and 20-event disconnect scenarios with zero event loss confirmed. Use the &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/deployment-profiles.md" rel="noopener noreferrer"&gt;Deployment Profiles&lt;/a&gt; guide to choose the appropriate durability level for your workload.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Deployment Profiles — From PoC to Compliance
&lt;/h3&gt;

&lt;p&gt;The event-driven FPolicy pattern supports three deployment profiles, each with clear boundaries for event loss tolerance and operational complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;PoC/Demo&lt;/th&gt;
&lt;th&gt;Production&lt;/th&gt;
&lt;th&gt;Compliance-sensitive&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy Server&lt;/td&gt;
&lt;td&gt;Fargate (direct IP)&lt;/td&gt;
&lt;td&gt;EC2 static IP or NLB&lt;/td&gt;
&lt;td&gt;EC2 static IP + NLB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is-mandatory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;true&lt;/code&gt; (ONTAP 9.15.1+)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;true&lt;/code&gt; (ONTAP 9.15.1+)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store&lt;/td&gt;
&lt;td&gt;Not required&lt;/td&gt;
&lt;td&gt;Recommended&lt;/td&gt;
&lt;td&gt;Required (ONTAP 9.14.1+)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry / Dedup&lt;/td&gt;
&lt;td&gt;Best-effort&lt;/td&gt;
&lt;td&gt;DynamoDB idempotency&lt;/td&gt;
&lt;td&gt;DynamoDB + S3 Object Lock lineage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alarm Profile&lt;/td&gt;
&lt;td&gt;Minimal (error only)&lt;/td&gt;
&lt;td&gt;Full (latency + error + backlog)&lt;/td&gt;
&lt;td&gt;Full + audit trail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event Loss Tolerance&lt;/td&gt;
&lt;td&gt;Acceptable (30-60s gap)&lt;/td&gt;
&lt;td&gt;Near-zero (retry compensates)&lt;/td&gt;
&lt;td&gt;Zero (Persistent Store + audit)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key design decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;is-mandatory=true&lt;/code&gt;&lt;/strong&gt; (ONTAP 9.15.1+): Blocks file operations when the FPolicy server is unavailable — prevents event loss but impacts availability. Use only with redundant server deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store&lt;/strong&gt; (ONTAP 9.14.1+): Buffers events in a dedicated SVM volume during server disconnection. Events are replayed in order upon reconnection. Sizing: 1 GB ≈ 2M events at ~500 bytes each.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay recovery time&lt;/strong&gt;: 100K buffered events at 100 events/sec = ~17 minutes to catch up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The progression path is incremental: PoC → Production → Compliance-sensitive, adding capabilities at each stage without redesigning the core architecture.&lt;/p&gt;

&lt;p&gt;Full profile documentation: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/deployment-profiles.md" rel="noopener noreferrer"&gt;Deployment Profiles&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. E2E Verification Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Protocol support matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;NFS Version&lt;/th&gt;
&lt;th&gt;Mount Option&lt;/th&gt;
&lt;th&gt;FPolicy NOTI_REQ&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NFSv3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vers=3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Immediate&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vers=4.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Immediate&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vers=4.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Immediate&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vers=4.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ Not sent&lt;/td&gt;
&lt;td&gt;Unsupported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4 (auto)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vers=4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ Not sent&lt;/td&gt;
&lt;td&gt;Negotiates to 4.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SMB/CIFS&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding&lt;/strong&gt;: &lt;code&gt;mount -o vers=4&lt;/code&gt; on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does not support. Always use &lt;code&gt;vers=4.1&lt;/code&gt; explicitly. This is documented in &lt;a href="https://kb.netapp.com/onprem/ontap/da/NAS/FAQ:_FPolicy:_Auditing" rel="noopener noreferrer"&gt;NetApp's FPolicy Auditing FAQ&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ONTAP version note&lt;/strong&gt;: NFSv4.1 FPolicy monitoring support was introduced in ONTAP 9.15.1. Earlier versions support SMB, NFSv3, and NFSv4.0 only. Our test environment runs ONTAP 9.17.1P6, which includes NFSv4.1 support. See &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/plan-fpolicy-event-config-concept.html" rel="noopener noreferrer"&gt;NetApp FPolicy event configuration documentation&lt;/a&gt; for the full protocol support matrix by ONTAP version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Path extraction bug fix
&lt;/h3&gt;

&lt;p&gt;ONTAP sends file paths in XML format within NOTI_REQ:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;PathNameType&amp;gt;&lt;/span&gt;WIN_NAME&lt;span class="nt"&gt;&amp;lt;/PathNameType&amp;gt;&amp;lt;PathName&amp;gt;&lt;/span&gt;\file.txt&lt;span class="nt"&gt;&amp;lt;/PathName&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The initial regex extraction left residual XML tags in the &lt;code&gt;file_path&lt;/code&gt; field. Fixed by adding an &lt;code&gt;_extract_xml_value()&lt;/code&gt; helper with multi-tag fallback and residual tag stripping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before fix&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;PathNameType&amp;gt;WIN_NAME&amp;lt;/PathNameType&amp;gt;&amp;lt;PathName&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;file.txt&amp;lt;/PathName&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After fix&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"file.txt"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  volume_name / svm_name resolution
&lt;/h3&gt;

&lt;p&gt;ONTAP's NOTI_REQ body does not always include volume and SVM names in a parseable location. Resolution strategy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract from NEGO_REQ session context (SVM name available at handshake)&lt;/li&gt;
&lt;li&gt;Fall back to environment variables (&lt;code&gt;SVM_NAME&lt;/code&gt;, &lt;code&gt;VOLUME_NAME&lt;/code&gt;) set in the ECS task definition&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Complete E2E flow (verified)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NFSv3 file create (tee /mnt/fsxn/file.txt)
→ ONTAP FPolicy NOTI_REQ
→ Fargate FPolicy Server receives event
→ SQS SendMessage
→ Bridge Lambda → EventBridge Custom Bus
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Actual EventBridge event&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detail-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FPolicy File Operation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fsxn.fpolicy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"event_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2175e878-1e0c-48ef-a8b3-53664d5d5b06"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"operation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test-eb-e2e-1778707951.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"volume_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vol1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"svm_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FSxN_OnPre"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-13T21:32:37.680626+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"client_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.0.10.67"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Unified UC Directory Structure
&lt;/h2&gt;

&lt;p&gt;Phase 10 introduces &lt;code&gt;event-driven-fpolicy/&lt;/code&gt; as a first-class shared pattern directory, using the same structure as the UC directories. It is not counted as one of the 17 industry UCs — it is a shared event-ingestion reference implementation that any UC can consume via EventBridge rules.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;event-driven-fpolicy/&lt;/span&gt;
&lt;span class="s"&gt;├── docs/&lt;/span&gt;                    &lt;span class="c1"&gt;# 8 languages (ja, en, ko, zh-CN, zh-TW, fr, de, es)&lt;/span&gt;
&lt;span class="s"&gt;│   ├── architecture.md&lt;/span&gt;      &lt;span class="c1"&gt;# + .en.md, .ko.md, etc.&lt;/span&gt;
&lt;span class="s"&gt;│   └── demo-guide.md&lt;/span&gt;
&lt;span class="s"&gt;├── functions/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── ip_updater/&lt;/span&gt;          &lt;span class="c1"&gt;# Fargate IP → ONTAP REST API&lt;/span&gt;
&lt;span class="s"&gt;│   └── sqs_to_eventbridge/&lt;/span&gt;  &lt;span class="c1"&gt;# Bridge Lambda&lt;/span&gt;
&lt;span class="s"&gt;├── schemas/&lt;/span&gt;
&lt;span class="s"&gt;│   └── fpolicy-event-schema.json&lt;/span&gt;
&lt;span class="s"&gt;├── server/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── Dockerfile&lt;/span&gt;           &lt;span class="c1"&gt;# ARM64 Python 3.12&lt;/span&gt;
&lt;span class="s"&gt;│   ├── fpolicy_server.py&lt;/span&gt;    &lt;span class="c1"&gt;# TCP listener + SQS sender&lt;/span&gt;
&lt;span class="s"&gt;│   └── requirements.txt&lt;/span&gt;
&lt;span class="s"&gt;├── tests/&lt;/span&gt;
&lt;span class="s"&gt;├── README.md&lt;/span&gt;                &lt;span class="c1"&gt;# + 7 language variants&lt;/span&gt;
&lt;span class="s"&gt;├── template.yaml&lt;/span&gt;            &lt;span class="c1"&gt;# Fargate deployment (ComputeType=fargate)&lt;/span&gt;
&lt;span class="s"&gt;└── template-ec2.yaml&lt;/span&gt;        &lt;span class="c1"&gt;# EC2 deployment (ComputeType=ec2)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single &lt;code&gt;template.yaml&lt;/code&gt; with a &lt;code&gt;ComputeType&lt;/code&gt; parameter (fargate/ec2) uses CloudFormation Conditions to select the appropriate resource set. The EC2 variant uses a t4g.micro with a static private IP — no IP update Lambda needed — at roughly ~$4/month. The Fargate variant avoids EC2 management but requires task-IP tracking and has a higher baseline cost (~$10/month for Fargate compute alone, plus VPC Endpoints). Actual cost varies by region, runtime hours, and VPC Endpoint configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Multi-Account StackSets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  StackSets compatibility validator
&lt;/h3&gt;

&lt;p&gt;New validator &lt;code&gt;scripts/check_stacksets_compatibility.py&lt;/code&gt; checks all 17 UC templates for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoded Account IDs&lt;/strong&gt; — 12-digit numeric strings that would break in other accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource name uniqueness&lt;/strong&gt; — names must include &lt;code&gt;!Sub&lt;/code&gt; with AccountId or StackName&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export name collisions&lt;/strong&gt; — exports that would conflict across accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPC/Subnet/SecurityGroup parameterization&lt;/strong&gt; — must not be hardcoded&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result: 17/17 templates, 0 errors, 0 warnings.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  StackSets role templates
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Template&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;shared/cfn/stacksets-admin.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Admin account role for StackSet management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;shared/cfn/stacksets-execution.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Target account execution role (least-privilege)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The execution role uses an Organization ID condition in its trust policy — accounts outside the Organization cannot assume it. Permissions are scoped to Lambda, Step Functions, DynamoDB, S3, CloudWatch, EventBridge, SNS, and Secrets Manager only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic deployment
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;AutoDeployment: Enabled&lt;/code&gt; on the StackSet, new accounts joining the Organization automatically receive the UC templates. No manual intervention required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope note&lt;/strong&gt;: Phase 10 validates that templates can be distributed safely across accounts via StackSets (deployment compatibility). It does not yet validate cross-account FSxN S3AP data access, shared VPC ownership, or centralized operations across accounts. Those runtime cross-account patterns are Phase 11+ work.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Alarm Profiles and Cost Optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  UC-specific alarm profiles
&lt;/h3&gt;

&lt;p&gt;Not all UCs have the same latency requirements. A batch genomics pipeline (UC3) tolerates higher failure rates than a real-time compliance monitor (UC12). Phase 10 introduces three profiles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Profile&lt;/th&gt;
&lt;th&gt;Failure Rate Threshold&lt;/th&gt;
&lt;th&gt;Error Threshold&lt;/th&gt;
&lt;th&gt;Target Workloads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BATCH&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;3/hour&lt;/td&gt;
&lt;td&gt;Periodic batch processing (UC1-5, UC9)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REALTIME&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;1/hour&lt;/td&gt;
&lt;td&gt;Real-time processing (UC10-14)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HIGH_VOLUME&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;5/hour&lt;/td&gt;
&lt;td&gt;High-volume file processing (UC6-8, UC15-17)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each UC template now has an &lt;code&gt;AlarmProfile&lt;/code&gt; parameter (BATCH / REALTIME / HIGH_VOLUME / CUSTOM). The &lt;code&gt;CUSTOM&lt;/code&gt; option exposes &lt;code&gt;CustomFailureThreshold&lt;/code&gt; and &lt;code&gt;CustomErrorThreshold&lt;/code&gt; for fine-grained control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic MaxConcurrency controller
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;shared/max_concurrency_controller.py&lt;/code&gt; calculates optimal Map state parallelism based on actual file volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_max_concurrency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;detected_file_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ontap_rate_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_calls_per_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;upper_bound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;optimal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;detected_file_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ontap_rate_limit&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;api_calls_per_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;upper_bound&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This replaces the static &lt;code&gt;MaxConcurrency: 10&lt;/code&gt; from Phase 8. For 500 files with default settings, it calculates &lt;code&gt;min(500, 33, 40) = 33&lt;/code&gt; — a 3.3x throughput improvement without exceeding ONTAP's rate limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business-hours cost scheduling
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;EnableCostScheduling=true&lt;/code&gt;, two EventBridge Schedulers dynamically adjust the polling frequency:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time Period&lt;/th&gt;
&lt;th&gt;Schedule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Business hours (weekday 09:00-18:00 JST)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate(1 hour)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours (weekday 18:00-09:00 + weekends)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate(6 hours)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;BusinessHoursStart&lt;/code&gt; and &lt;code&gt;BusinessHoursEnd&lt;/code&gt; parameters allow customization. The Cost Scheduler emits an &lt;code&gt;EstimatedMonthlySavings&lt;/code&gt; CloudWatch metric for visibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 Access Points Performance Considerations
&lt;/h3&gt;

&lt;p&gt;Key performance characteristics (from &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/accessing-data-via-s3-access-points.html" rel="noopener noreferrer"&gt;AWS documentation&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Tens of milliseconds (consistent with S3 bucket access)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: Depends on the FSx file system's provisioned throughput capacity — S3 AP, NFS, and SMB all share the same throughput pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object size limit&lt;/strong&gt;: 5 GB for uploads (PutObject); downloads (GetObject) can be larger&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage class&lt;/strong&gt;: &lt;code&gt;FSX_ONTAP&lt;/code&gt; only; SSE-FSX encryption only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Design implications for serverless pipelines:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lambda memory → network bandwidth&lt;/strong&gt;: Higher Lambda memory allocates more network bandwidth. For 10 MB file processing, 1,769 MB (1 vCPU) provides ~600 Mbps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step Functions Map concurrency&lt;/strong&gt;: Limit &lt;code&gt;MaxConcurrency&lt;/code&gt; based on FSx provisioned throughput. Formula: &lt;code&gt;fsxn_throughput / per_lambda_throughput&lt;/code&gt;. Example: 512 MBps ÷ 50 MBps per Lambda ≈ 10 concurrent executions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ListObjectsV2 pagination&lt;/strong&gt;: MaxKeys=1000 per page. For 10,000 files = 10 pages × ~50ms = ~500ms minimum. Use Prefix filtering to reduce scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared throughput&lt;/strong&gt;: S3 AP, NFS, and SMB all share the same FSx throughput capacity. Account for existing NFS/SMB workloads when sizing Map concurrency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry strategy&lt;/strong&gt;: Use &lt;code&gt;botocore.config.Config(retries={"mode": "adaptive"})&lt;/code&gt; for automatic backoff on SlowDown (503) responses.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Full analysis: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/s3ap-performance-considerations.md" rel="noopener noreferrer"&gt;S3AP Performance Considerations&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Test Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phase 10 new tests&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property-based tests (Hypothesis)&lt;/td&gt;
&lt;td&gt;7 properties × 100-200 iterations&lt;/td&gt;
&lt;td&gt;All PASS ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existing tests (Phase 1-9)&lt;/td&gt;
&lt;td&gt;982&lt;/td&gt;
&lt;td&gt;No regressions ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1044+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;All PASS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Property-based tests
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;What it verifies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy event round-trip&lt;/td&gt;
&lt;td&gt;Serialize → deserialize produces equivalent object&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MaxConcurrency bounds&lt;/td&gt;
&lt;td&gt;Result always ≥ 1 and ≤ upper_bound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MaxConcurrency correctness&lt;/td&gt;
&lt;td&gt;Result matches the min() formula&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero files → 1&lt;/td&gt;
&lt;td&gt;Empty input never produces 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StackSets Account ID detection&lt;/td&gt;
&lt;td&gt;Known violations are always caught&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost savings non-negativity&lt;/td&gt;
&lt;td&gt;Estimated savings ≥ 0 for all inputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same rate → ~0 savings&lt;/td&gt;
&lt;td&gt;Equal business/off-hours rates produce near-zero savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Validator results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Validator&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;check_s3ap_iam_patterns.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;17/17 clean ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;check_handler_names.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;87 handlers, 0 issues ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;check_conditional_refs.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;17 templates, 0 issues ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;check_stacksets_compatibility.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;17 templates, 0 errors ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;_check_sensitive_leaks.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;160 images, 0 leaks ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cfn-guard IAM security&lt;/td&gt;
&lt;td&gt;Advisory, 0 new violations ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  7. Deployment Learnings
&lt;/h2&gt;

&lt;p&gt;Several issues surfaced during AWS verification that are worth documenting:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NLB path: FPolicy handshake fails&lt;/td&gt;
&lt;td&gt;ONTAP FPolicy expects direct TCP to primary-servers IP; NLB target routing does not satisfy session establishment (tested with preserve_client_ip true and false)&lt;/td&gt;
&lt;td&gt;Direct Fargate IP + EventBridge IP auto-update&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jsonschema 4.18+ fails on ARM64 Lambda&lt;/td&gt;
&lt;td&gt;rpds-py native dependency&lt;/td&gt;
&lt;td&gt;Pin to 4.17.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SCHEMA_PATH differs between Lambda and local&lt;/td&gt;
&lt;td&gt;Different working directories&lt;/td&gt;
&lt;td&gt;Fallback path resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guard Hook rejects Condition-based &lt;code&gt;Resource: "*"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Overly strict rule&lt;/td&gt;
&lt;td&gt;Updated rule to allow &lt;code&gt;Condition exists&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECR pull fails in private subnet&lt;/td&gt;
&lt;td&gt;Missing VPC Endpoints&lt;/td&gt;
&lt;td&gt;Added ECR, STS, S3, Logs, SQS endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KEEP_ALIVE timeout race&lt;/td&gt;
&lt;td&gt;Server timeout = keep_alive_interval&lt;/td&gt;
&lt;td&gt;Increased to 300s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4 events not firing&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vers=4&lt;/code&gt; negotiates to unsupported 4.2&lt;/td&gt;
&lt;td&gt;Explicit &lt;code&gt;vers=4.1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  7a. Beyond AI/ML — Enterprise Workload Examples
&lt;/h2&gt;

&lt;p&gt;This pattern is not limited to AI/ML demos. The S3 Access Points architecture applies to any enterprise file data on FSx for ONTAP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SAP peripheral files and exported business documents&lt;/strong&gt; — IDoc exports, ABAP report outputs, BW data extracts. Process without changing SAP file interfaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EDI / HULFT landing zones&lt;/strong&gt; — Automatic validation and format conversion of received files. No changes to existing EDI/HULFT infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit evidence and compliance reports&lt;/strong&gt; — Periodic integrity checks, retention management, with NTFS permissions preserved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch output from EC2-based business applications&lt;/strong&gt; — Add serverless post-processing pipelines without changing application output paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scanned documents and regulated records&lt;/strong&gt; — OCR, classification, PII detection on documents stored for long-term retention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The design principle&lt;/strong&gt;: File data stays on FSx for ONTAP. S3 Access Points provide the bridge to AWS-native automation, AI/ML, and analytics services — without data movement, without changing existing NFS/SMB access patterns. Existing backup (SnapMirror), DR, and access controls remain unchanged.&lt;/p&gt;

&lt;p&gt;This positioning matters for partner and SI proposals: the value is not "replace your file server" but "connect your existing file data to AWS services without migration."&lt;/p&gt;

&lt;p&gt;Full examples with architecture diagrams: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/enterprise-workload-examples.md" rel="noopener noreferrer"&gt;Enterprise Workload Examples&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Next Phase Outlook
&lt;/h2&gt;

&lt;p&gt;Phase 10 established the shared event-ingestion pipeline (FPolicy → SQS → EventBridge). Phase 11 will wire those events into UC-specific processing. Candidates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TriggerMode rollout to all 17 UCs&lt;/strong&gt;: Expand the reference implementation from UC1 to all templates, with UC-specific EventBridge dispatch rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy → UC-specific Step Functions dispatch&lt;/strong&gt;: EventBridge rules matching file path prefixes/extensions to UC targets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;protobuf format evaluation&lt;/strong&gt;: ONTAP 9.15.1+ supports protobuf for higher-performance notifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Account Observability live verification&lt;/strong&gt;: Deploy the shared-services-observability template and validate metric aggregation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store evaluation&lt;/strong&gt;: Phase 11+ design-dependent work for compliance-sensitive environments that cannot tolerate event loss during Fargate task restarts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FR-2 migration path&lt;/strong&gt;: When AWS ships native S3AP notifications, the TriggerMode parameter provides a clean migration — switch from &lt;code&gt;EVENT_DRIVEN&lt;/code&gt; to native events without changing UC logic&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Native S3AP Notifications Still Matter
&lt;/h3&gt;

&lt;p&gt;This FPolicy-based pipeline proves that customers need event-driven processing for FSx for ONTAP S3 Access Points. However, it also quantifies where a native AWS-managed notification feature would eliminate undifferentiated heavy lifting:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operational burden&lt;/th&gt;
&lt;th&gt;Current (FPolicy)&lt;/th&gt;
&lt;th&gt;With native notifications&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Long-running TCP listener&lt;/td&gt;
&lt;td&gt;Fargate 24/7 (~$30-50/month)&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fargate task IP tracking&lt;/td&gt;
&lt;td&gt;IP Updater Lambda + ONTAP REST API&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP external-engine reconfiguration&lt;/td&gt;
&lt;td&gt;On every deployment&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy protocol dependency&lt;/td&gt;
&lt;td&gt;NFSv4.2 not supported&lt;/td&gt;
&lt;td&gt;Protocol-independent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event durability semantics&lt;/td&gt;
&lt;td&gt;Requires Persistent Store (ONTAP 9.14.1+)&lt;/td&gt;
&lt;td&gt;S3-equivalent at-least-once&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account event routing&lt;/td&gt;
&lt;td&gt;SQS → Bridge Lambda → EventBridge → cross-account&lt;/td&gt;
&lt;td&gt;Standard EventBridge rules&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Implementation complexity&lt;/strong&gt;: 15-20 CloudFormation resources and 2 Lambda functions (IP Updater + Bridge) for FPolicy, vs an estimated 3-5 resources for native EventBridge integration.&lt;/p&gt;

&lt;p&gt;The FPolicy implementation is not a replacement for native S3AP notifications — it is &lt;strong&gt;evidence of customer demand&lt;/strong&gt; and an interim event-driven pattern. The operational complexity documented here directly maps to the value a native feature would deliver.&lt;/p&gt;

&lt;p&gt;Full analysis: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/aws-feature-requests/native-s3ap-notifications-evidence.md" rel="noopener noreferrer"&gt;Native S3AP Notifications Evidence&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Partner/SI Delivery Checklist
&lt;/h3&gt;

&lt;p&gt;For partners and SIs proposing this pattern to enterprise customers, a structured delivery checklist is available covering:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Customer workload classification&lt;/strong&gt; — SAP-adjacent / file server / regulated records / AI analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger mode selection&lt;/strong&gt; — POLLING / EVENT_DRIVEN / HYBRID based on latency and durability requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment profile&lt;/strong&gt; — PoC / Production / Compliance-sensitive with clear boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access model design&lt;/strong&gt; — IAM + S3 AP policy + ONTAP file permissions (dual-layer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network model&lt;/strong&gt; — Private VPC / VPC Origin AP / Cross-Account / Shared Services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operating model&lt;/strong&gt; — Customer-operated / partner-operated / managed service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success criteria&lt;/strong&gt; — Latency, throughput, cost, auditability, recovery behavior&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The checklist also includes a 4-phase PoC implementation guide (environment prep → POLLING verification → EVENT_DRIVEN verification → evaluation) and FAQ for common partner questions.&lt;/p&gt;

&lt;p&gt;Full checklist: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/partner-si-delivery-checklist.md" rel="noopener noreferrer"&gt;Partner/SI Delivery Checklist&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Who should care about Phase 10?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Platform teams&lt;/strong&gt; get an event-driven alternative to polling — near-real-time latency instead of hourly polling intervals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt; get StackSets compatibility validation ensuring no hardcoded account IDs leak across environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations teams&lt;/strong&gt; get workload-appropriate alarm thresholds that reduce alert fatigue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance teams&lt;/strong&gt; get fewer off-hours polling invocations through business-hours scheduling, with savings surfaced as a CloudWatch metric&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage teams&lt;/strong&gt; get a documented FPolicy integration pattern with protocol-level verification results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-account teams&lt;/strong&gt; get ready-to-deploy StackSets admin/execution roles with Organization-scoped trust&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partners and SIs&lt;/strong&gt; get a PoC-ready event-driven alternative for customers who cannot wait for native S3AP notifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulated workload owners&lt;/strong&gt; get a clear event-durability boundary: near-real-time by default, Persistent Store required when event loss is unacceptable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAP / ERP teams&lt;/strong&gt; get a pattern for connecting peripheral files (IDoc, HULFT, batch output) to AWS AI/analytics without changing existing file interfaces&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Phase 10 solves the problem that has been deferred since Phase 1: how do you get event-driven processing from FSx for ONTAP when S3AP native notifications don't exist?&lt;/p&gt;

&lt;p&gt;The answer is ONTAP FPolicy — a mature notification framework that predates S3 Access Points by over a decade. By connecting it to ECS Fargate → SQS → EventBridge, Phase 10 established the shared event-ingestion pipeline and the &lt;code&gt;TriggerMode&lt;/code&gt; parameter foundation needed to support polling, event-driven, and hybrid modes. UC-specific dispatch remains the main Phase 11 focus. The default remains &lt;code&gt;POLLING&lt;/code&gt;, so existing deployments are unaffected.&lt;/p&gt;

&lt;p&gt;The E2E verification confirmed that NFSv3, NFSv4.0, NFSv4.1 (ONTAP 9.15.1+), and SMB all work. NFSv4.2 does not — and the most common failure mode is &lt;code&gt;mount -o vers=4&lt;/code&gt; silently negotiating to 4.2. This is now documented and the setup guide recommends explicit version pinning.&lt;/p&gt;

&lt;p&gt;Beyond FPolicy, Phase 10 matures the operational model: StackSets deployment compatibility for multi-account distribution, alarm profiles for workload-appropriate monitoring, and cost scheduling for environments that don't need 24/7 polling. Combined with the 6-validator CI pipeline and 1044+ passing tests, the pattern library is ready for production-style multi-account template distribution, while runtime cross-account data-path validation remains Phase 11+ work.&lt;/p&gt;




&lt;h3&gt;
  
  
  Design Guides
&lt;/h3&gt;

&lt;p&gt;The following design guides have been added to the repository:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/s3ap-authorization-model.md" rel="noopener noreferrer"&gt;S3AP Authorization Model&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Dual-layer authorization (IAM + file system)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/deployment-profiles.md" rel="noopener noreferrer"&gt;Deployment Profiles&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;PoC / Production / Compliance-sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/trigger-mode-decision-guide.md" rel="noopener noreferrer"&gt;Trigger Mode Decision Guide&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;POLLING / EVENT_DRIVEN / HYBRID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/enterprise-workload-examples.md" rel="noopener noreferrer"&gt;Enterprise Workload Examples&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;SAP, EDI, audit, batch output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/s3ap-performance-considerations.md" rel="noopener noreferrer"&gt;S3AP Performance&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Throughput, Lambda sizing, concurrency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/aws-feature-requests/native-s3ap-notifications-evidence.md" rel="noopener noreferrer"&gt;Native Notifications Evidence&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Feature request evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/partner-si-delivery-checklist.md" rel="noopener noreferrer"&gt;Partner/SI Delivery Checklist&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Partner/SI proposal and delivery guide&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Update Note
&lt;/h3&gt;

&lt;p&gt;This article describes the Phase 10 baseline — the first verified shared FPolicy ingestion pipeline. The event-driven pipeline is expanded across all 17 UCs in Phase 11 and operationally hardened in Phase 12 with Persistent Store replay validation, SLO observability, capacity guardrails, and secrets rotation. Phase 13 adds FlexClone/FlexCache serverless automation.&lt;/p&gt;

&lt;p&gt;Use the FPolicy event-driven mode for PoC and near-real-time ingestion. For regulated or compliance-sensitive workloads, evaluate Persistent Store, replay handling, deduplication, and operational runbooks before treating the pipeline as durable. See &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns/blob/main/docs/deployment-profiles.md" rel="noopener noreferrer"&gt;Deployment Profiles&lt;/a&gt; for guidance.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Previous phases&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;Phase 1&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/public-sector-use-cases-unified-output-destination-and-a-localization-batch-fsx-for-ontap-s3-2hmo"&gt;Phase 7&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/operational-hardening-ci-grade-validation-and-pattern-c-b-hybrid-fsx-for-ontap-s3-access-587h"&gt;Phase 8&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/production-rollout-vpc-endpoint-auto-detection-and-the-cdk-no-go-fsx-for-ontap-s3-access-3lni"&gt;Phase 9&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next phases&lt;/strong&gt;: Phase 11 (UC-specific dispatch) · Phase 12 (Persistent Store replay + SLO hardening) · Phase 13 (FlexClone/FlexCache automation)&lt;/p&gt;

</description>
      <category>aws</category>
      <category>amazonfsxfornetappontap</category>
      <category>serverless</category>
      <category>fpolicy</category>
    </item>
  </channel>
</rss>
