TL;DR
FSx for ONTAP file access audit logs are usually consumed through EC2-based patterns — mounted audit volumes and agent-based forwarders such as Splunk Universal Forwarder. This series explores an EC2-free alternative: configure ONTAP to write audit logs to an audit volume, expose that volume through an FSx for ONTAP S3 Access Point, use EventBridge Scheduler to invoke Lambda, and ship normalized events to observability platforms such as Datadog, Splunk, New Relic, Grafana Cloud, Elastic, and OpenTelemetry-compatible backends.
What This Post Covers
This post introduces the architecture and the open-source pattern library. It does not yet cover:
- Full Datadog deployment walkthrough (Part 2)
- Vendor-specific field mappings
- Cost/performance benchmarking
- ARP + EMS webhook + Datadog alerting (Part 3)
- FPolicy binary protocol internals (future post)
The Problem Nobody Talks About
You're running Amazon FSx for NetApp ONTAP. You've enabled file access auditing because compliance requires it — or because you genuinely want to know who's accessing what on your file shares.
But where do those audit logs go?
If you followed the official AWS blog post, you likely ended up with EC2-based collectors: syslog-ng for cluster/admin audit forwarding, and a mounted audit volume plus Splunk Universal Forwarder for file access audit logs. It works. But now you have:
- EC2 instances to patch and maintain
- NFS mounts to the audit volume
- syslog-ng configuration for admin audit forwarding
- Splunk Universal Forwarder configuration for file access logs
- A single point of failure unless you build your own HA pattern
- Vendor lock-in to Splunk's agent-based model
What if you could replace that EC2-based collector pattern with managed services — Lambda reads audit logs via S3 APIs, no NFS mount required — and ship to any observability platform?
That's the goal of this project.
Important Distinction: Two Types of ONTAP Audit
Before diving in, a clarification. FSx for ONTAP has two distinct audit mechanisms:
Cluster/admin activity audit logs — Administrative operations (CLI/API commands). These are forwarded via syslog to a log destination, as described in the AWS blog.
File access audit logs — SMB/NFS file operations (open, read, write, delete, permission changes). These are recorded based on ONTAP audit policies and SACLs/NFSv4 ACLs, stored in EVTX or XML format, depending on your ONTAP audit configuration, on an audit volume inside the SVM.
In this series, "audit logs" refers to file access audit logs (type 2). The cluster admin audit forwarding via syslog is a separate concern.
The EC2-Free Alternative
I'm building an open-source pattern library that targets 9 observability vendors using Lambda, EventBridge Scheduler, and ECS Fargate — eliminating the need for self-managed EC2 instances.
This is EC2-free, not necessarily Lambda-only:
- The audit-log and EMS paths are Lambda patterns (scheduled and event-driven respectively).
- The FPolicy path uses ECS Fargate because ONTAP FPolicy requires a persistent TCP listener.
How Audit Logs Flow
ONTAP's file access auditing writes rotated audit log files to a configured destination path inside the SVM. In this project, that destination is an audit volume exposed through an FSx for ONTAP S3 Access Point. Lambda does not mount NFS or SMB; it reads the rotated audit log files through S3 APIs.
Because this pattern does not rely on S3 ObjectCreated events from FSx for ONTAP S3 Access Points, the audit processor is invoked on a schedule and uses checkpointing to process only newly rotated log files.
FSx for ONTAP audit configuration (`vserver audit`)
│
▼ audit logs written to /audit volume
Audit volume exposed via FSx for ONTAP S3 Access Point
│
▼ EventBridge Scheduler (periodic invocation)
Lambda audit processor (Python 3.12)
│
▼ parse EVTX/XML → normalize → vendor API
Datadog / Splunk / New Relic / Grafana / Elastic / ...
The key shift from the EC2 pattern: Lambda does not mount the audit volume over NFS or SMB. It reads rotated ONTAP audit log files through an FSx for ONTAP S3 Access Point using S3 APIs, while the data itself remains on the FSx for ONTAP file system.
A note on FSx for ONTAP S3 Access Points: FSx for ONTAP S3 Access Points let applications use S3 APIs to access data that still resides on FSx for ONTAP volumes. They are excellent as a serverless access boundary, but they are not the same as standard S3 buckets. In particular, you should not rely on S3 ObjectCreated notifications from an FSx for ONTAP S3 Access Point. Instead, this project uses EventBridge Scheduler plus checkpointing to discover and process newly rotated audit log files.
Three Event Sources, One Architecture
FSx for ONTAP generates observability data through three distinct channels:
1. File Access Audit Logs (FSx for ONTAP S3 AP)
Depending on your ONTAP audit configuration and SACL/NFSv4 ACL settings, file operations such as create, delete, read, write, and permission changes can be recorded as ONTAP audit logs in EVTX or XML format.
- Delivery: ONTAP writes rotated audit log files to an audit volume inside the SVM
- Access path: Lambda reads those files through an FSx for ONTAP S3 Access Point
- Trigger: EventBridge Scheduler invokes Lambda periodically; Lambda uses checkpointing to process newly rotated files
- Compute: Lambda (scheduled, pay-per-invocation)
- Latency: Near-real-time rather than sub-second streaming. End-to-end latency depends on your ONTAP audit log rotation interval and the EventBridge Scheduler frequency.
- Use case: Compliance auditing, access pattern analysis, data governance
2. EMS (Event Management System) Webhooks
ONTAP's built-in event system can push critical alerts via HTTP webhooks. This includes:
- Autonomous Ransomware Protection (ARP) alerts — ONTAP detects encryption patterns and fires an event
- Quota threshold violations
- Hardware failures
Replication issues
Delivery: ONTAP pushes HTTPS webhook to API Gateway
Trigger: API Gateway invocation (event-driven)
Compute: Lambda (behind API Gateway)
Use case: Security alerting, operational monitoring
3. FPolicy (File Policy) Events
FPolicy intercepts file operations at the protocol level (CIFS/NFS) and forwards them in real-time via a proprietary TCP protocol. Unlike the other two sources, FPolicy requires a persistent TCP listener — which is why this path uses ECS Fargate rather than Lambda.
- Delivery: ONTAP connects to Fargate task via TCP:9898
- Trigger: Fargate receives FPolicy events → enqueues to SQS → Lambda processes
- Compute: ECS Fargate (TCP listener) + Lambda (vendor shipping)
- Use case: File activity monitoring, DLP, suspicious behavior detection
Note: The FPolicy path is the one exception to the "pure Lambda" model. ONTAP's FPolicy protocol is a proprietary binary format over TCP — it cannot be received by API Gateway or Lambda directly. Fargate handles the protocol translation, then hands off to Lambda via SQS for the vendor-specific shipping. It's still EC2-free, but not entirely serverless in the strictest sense.
The Architecture
Each event source feeds into the same delivery pattern:
┌─────────────────────────────────────────────────────────────────┐
│ FSx for ONTAP │
├──────────────┬──────────────────────┬───────────────────────────┤
│ File Access │ EMS Webhook │ FPolicy (TCP:9898) │
│ Audit Logs │ │ │
└──────┬───────┴──────────┬───────────┴───────────┬───────────────┘
│ │ │
▼ ▼ ▼
FSx S3 AP + API Gateway ECS Fargate
Scheduler │ │
│ ▼ ▼
▼ Lambda (EMS) SQS → Lambda
Lambda (parser) │ │
│ │ │
└──────────────────┼───────────────────────┘
▼
Observability Vendor API
(Datadog, Splunk, New Relic, ...)
Each integration packages the parser and vendor shipper together in a single Lambda, but the pattern is the same: normalize ONTAP events, then send them to the vendor API. Swap the integration Lambda, and you switch vendors. Vendor-specific Lambdas are optimized for quick adoption and native API behavior, while the OpenTelemetry integration provides a vendor-neutral path for organizations standardizing on OTLP.
The Gotcha That Cost Me a Day
Here's something that isn't immediately obvious from the documentation:
In my validation, a Lambda function placed in a VPC with only an S3 Gateway Endpoint could not read from the FSx for ONTAP S3 Access Point and timed out. Adding NAT Gateway egress resolved the issue.
This gotcha matters because this project intentionally reads audit logs through FSx for ONTAP S3 Access Points rather than mounting the audit volume over NFS/SMB from an EC2 instance.
Tested with:
- Lambda in private subnets (ap-northeast-1)
- FSx for ONTAP S3 Access Point attached to an FSx volume
- S3 Gateway VPC Endpoint only
- No NAT Gateway
- Failure mode: timeout (no response, not AccessDenied)
Your options:
| Lambda Placement | FSx for ONTAP S3 AP Access | Recommendation |
|---|---|---|
| Outside VPC | ✅ Works | Simplest for read-only access |
| VPC + NAT Gateway | ✅ Works | Production recommended |
| VPC + S3 Gateway EP only | ❌ Timeout | Not recommended based on this validation |
This is based on my validation environment (ap-northeast-1). Always test the network path in your own account and Region, as AWS may update this behavior.
Target Vendors
The project targets 9 observability platforms. Datadog is fully verified end-to-end (the subject of Parts 2 and 3 of this series). The remaining vendors have initial implementations that I'll be verifying and writing about in upcoming posts:
| Vendor | Delivery Method | Status |
|---|---|---|
| Datadog | Logs API v2 | ✅ E2E verified |
| Splunk | HEC (HTTP Event Collector) | 🧪 Implementation ready, verification planned |
| New Relic | Log API v1 | 🧪 Implementation ready, verification planned |
| Grafana Cloud | Loki Push API | 🧪 Implementation ready, verification planned |
| Elastic | Bulk API | 🧪 Implementation ready, verification planned |
| Dynatrace | Log Ingest API v2 | 🧪 Implementation ready, verification planned |
| Sumo Logic | HTTP Source | 🧪 Implementation ready, verification planned |
| Honeycomb | Events Batch API | 🧪 Implementation ready, verification planned |
| OpenTelemetry | OTLP/HTTP (vendor-neutral) | 🧪 Implementation ready, verification planned |
Status definitions:
- ✅ E2E verified — Deployed and validated with real FSx for ONTAP audit logs
- 🧪 Implementation ready — Code and CloudFormation available; E2E validation pending
- 🚧 Planned — Design exists; implementation pending
Each vendor integration is designed as a self-contained CloudFormation stack with its own Lambda, IAM roles, DLQ, and CloudWatch alarms. As I verify each one, I'll publish a dedicated article with the results and any vendor-specific gotchas I encounter.
What's in the Repo
The project is structured for easy adoption:
fsxn-observability-integrations/
├── integrations/
│ ├── datadog/ # ✅ Verified: Lambda + CFn + tests + docs
│ ├── splunk-serverless/ # 🧪 Implementation ready
│ ├── new-relic/ # 🧪 Implementation ready
│ ├── grafana/ # 🧪 Implementation ready
│ ├── elastic/ # 🧪 Implementation ready
│ ├── dynatrace/ # 🧪 Implementation ready
│ ├── sumo-logic/ # 🧪 Implementation ready
│ ├── honeycomb/ # 🧪 Implementation ready
│ └── otel-collector/ # 🧪 Implementation ready
├── shared/
│ ├── lambda-layers/ # Reusable log parser (EVTX/XML) + S3 AP reader
│ ├── templates/ # Prerequisites CFn (EventBridge Scheduler, IAM)
│ └── scripts/ # Deploy + test utilities
└── docs/ # Bilingual (EN/JA) documentation
The shared infrastructure (EventBridge Scheduler, log parser layer, IAM roles) is vendor-agnostic and already proven through the Datadog verification. Each vendor directory follows the same structure, so once you understand one, you understand them all. Each stack is designed to include DLQ, CloudWatch alarms, and operational visibility out of the box; the Datadog stack also includes the verified CloudWatch operational dashboard used during E2E validation.
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
Related Posts
If you've been following my FSx for ONTAP S3 Access Points series, this project builds directly on those foundations:
- FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — Where this journey started: using S3 APs as the bridge between ONTAP and serverless
- Production-Ready FPolicy Event Pipeline Across 17 UCs — Phase 11 — The FPolicy pipeline that feeds into this observability project
- Near-Real-Time Processing, ML Inference, and Observability — Phase 3 — Early architecture patterns that evolved into this multi-vendor approach
This observability integrations project is the natural next step: taking those serverless patterns and applying them specifically to audit log shipping across multiple vendors.
Design Considerations
Based on early feedback, here are key points for different audiences:
Design philosophy: The goal is not just to remove EC2. The goal is to move undifferentiated collector operations into managed services, make failures observable and replayable, and keep the integration layer small enough for customers to operate themselves.
Where this pattern matters: This pattern is especially useful for enterprise file workloads where auditability matters but EC2-based collectors add operational overhead —
departmental file shares, enterprise application interface directories such as SAP, Oracle, or SQL Server adjacent file shares, VDI/EUC home directories, engineering and design repositories, regulated file repositories, and ransomware investigation workflows.
Non-intrusive by design: This pipeline observes audit logs after ONTAP records them; it does not sit in the application data path. NFS/SMB access patterns are unchanged. No application code changes are required.
Telemetry ownership: This pattern treats ONTAP as the authoritative source of file activity telemetry, while AWS managed services provide the event processing and delivery layer.
Compliance note: This pattern helps centralize and analyze audit events, but retention, immutability, and regulatory controls should be designed according to your organization's compliance requirements. This is an audit log delivery pattern, not a compliance certification. For audit evidence, consider separately how long raw EVTX/XML files should be retained on the audit volume or archived outside the observability pipeline.
Audit policy dependency: The quality and volume of events depend heavily on your ONTAP audit policy, SACLs, NFSv4 ACLs, and rotation interval. Enabling read auditing on high-traffic volumes can produce significant log volume — design your audit policy carefully.
Cost variables: The biggest cost factors are audit event volume, log rotation frequency, EventBridge Scheduler frequency, Lambda runtime, NAT Gateway usage (if Lambda is in VPC), and vendor ingest pricing. Compared to the EC2 pattern, you trade always-on instance cost for pay-per-invocation compute and vendor-ingest-driven cost.
Multi-account deployment: This pattern can be deployed per workload account or centralized into a logging/security account, depending on your organization's landing zone design.
Reliability: The stack includes DLQ for failed events, CloudWatch alarms for error/throttle detection, and checkpointing to avoid reprocessing already completed audit log files. Delivery to external vendor APIs should be treated as at-least-once; DLQ messages can be replayed after resolving the root cause.
What's Coming Next
This is Part 1 of a series. In the upcoming posts, I'll deep-dive into:
- Part 2: Implementing the Datadog integration end-to-end — from CloudFormation to seeing logs in the Datadog Log Explorer
- Part 3: Event-driven ransomware detection using ONTAP's Autonomous Ransomware Protection (ARP) + EMS webhooks + Datadog alerting
Beyond this Datadog series, I'll be verifying and writing about each vendor integration as I go:
- Replacing the EC2-based Splunk pattern with Lambda + HEC
- OpenTelemetry as the vendor-neutral escape hatch
- Grafana Cloud + Loki for the open-source stack
- And more — each with its own E2E verification and lessons learned
The goal is to build a comprehensive, battle-tested pattern library where you can pick your vendor and deploy with confidence. Follow along as I work through each one.
Try It Yourself
The Datadog integration is fully verified and ready to deploy. You'll need:
- An FSx for ONTAP file system with audit logging enabled
- An FSx for ONTAP S3 Access Point attached to the audit volume
git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git
cd fsxn-observability-integrations
# Deploy Datadog integration
# (FsxS3AccessPointArn = your FSx for ONTAP S3 Access Point ARN)
aws cloudformation deploy \
--template-file integrations/datadog/template.yaml \
--stack-name fsxn-datadog-integration \
--parameter-overrides \
FsxS3AccessPointArn=<your-fsx-s3-ap-arn> \
DatadogApiKeySecretArn=<your-secret-arn> \
--capabilities CAPABILITY_NAMED_IAM
This stack deploys the scheduled Lambda processor, IAM permissions for reading from the FSx for ONTAP S3 Access Point, checkpoint storage, DLQ, CloudWatch alarms, and the Datadog shipping logic. The processor keeps track of already-processed audit log files so each scheduled invocation only ships newly rotated logs.
After deployment, you should see:
- EventBridge Scheduler invoking the Lambda processor on your configured interval
- Checkpoint storage updated after processing rotated audit logs
- Parsed FSx for ONTAP audit events arriving in Datadog Logs (
source:fsxn) - CloudWatch alarms and DLQ ready for operational visibility
Full setup guide in the repo's Prerequisites doc.
Have questions or want to see a specific vendor integration verified next? Drop a comment below — it'll help me prioritize the series.
Next up: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Top comments (0)