Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 17

Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2

#aws #serverless #observability #amazonfsxfornetappontap

TL;DR

FSx for ONTAP file access audit logs are usually consumed through EC2-based patterns — mounted audit volumes and agent-based forwarders such as Splunk Universal Forwarder. This series explores an EC2-free alternative: configure ONTAP to write audit logs to an audit volume, expose that volume through an FSx for ONTAP S3 Access Point, use EventBridge Scheduler to invoke Lambda, and ship normalized events to observability platforms such as Datadog, Splunk, New Relic, Grafana Cloud, Elastic, and OpenTelemetry-compatible backends.

What This Post Covers

This post introduces the architecture and the open-source pattern library. It does not yet cover:

Full Datadog deployment walkthrough (Part 2)
Vendor-specific field mappings
Cost/performance benchmarking
ARP + EMS webhook + Datadog alerting (Part 3)
FPolicy binary protocol internals (future post)

The Problem Nobody Talks About

You're running Amazon FSx for NetApp ONTAP. You've enabled file access auditing because compliance requires it — or because you genuinely want to know who's accessing what on your file shares.

But where do those audit logs go?

If you followed the official AWS blog post, you likely ended up with EC2-based collectors: syslog-ng for cluster/admin audit forwarding, and a mounted audit volume plus Splunk Universal Forwarder for file access audit logs. It works. But now you have:

EC2 instances to patch and maintain
NFS mounts to the audit volume
syslog-ng configuration for admin audit forwarding
Splunk Universal Forwarder configuration for file access logs
A single point of failure unless you build your own HA pattern
Vendor lock-in to Splunk's agent-based model

What if you could replace that EC2-based collector pattern with managed services — Lambda reads audit logs via S3 APIs, no NFS mount required — and ship to any observability platform?

That's the goal of this project.

Important Distinction: Two Types of ONTAP Audit

Before diving in, a clarification. FSx for ONTAP has two distinct audit mechanisms:

Cluster/admin activity audit logs — Administrative operations (CLI/API commands). These are forwarded via syslog to a log destination, as described in the AWS blog.
File access audit logs — SMB/NFS file operations (open, read, write, delete, permission changes). These are recorded based on ONTAP audit policies and SACLs/NFSv4 ACLs, stored in EVTX or XML format, depending on your ONTAP audit configuration, on an audit volume inside the SVM.

In this series, "audit logs" refers to file access audit logs (type 2). The cluster admin audit forwarding via syslog is a separate concern.

The EC2-Free Alternative

I'm building an open-source pattern library that targets 9 observability vendors using Lambda, EventBridge Scheduler, and ECS Fargate — eliminating the need for self-managed EC2 instances.

This is EC2-free, not necessarily Lambda-only:

The audit-log and EMS paths are Lambda patterns (scheduled and event-driven respectively).
The FPolicy path uses ECS Fargate because ONTAP FPolicy requires a persistent TCP listener.

How Audit Logs Flow

ONTAP's file access auditing writes rotated audit log files to a configured destination path inside the SVM. In this project, that destination is an audit volume exposed through an FSx for ONTAP S3 Access Point. Lambda does not mount NFS or SMB; it reads the rotated audit log files through S3 APIs.

Because this pattern does not rely on S3 ObjectCreated events from FSx for ONTAP S3 Access Points, the audit processor is invoked on a schedule and uses checkpointing to process only newly rotated log files.

FSx for ONTAP audit configuration (`vserver audit`)
    │
    ▼ audit logs written to /audit volume
Audit volume exposed via FSx for ONTAP S3 Access Point
    │
    ▼ EventBridge Scheduler (periodic invocation)
Lambda audit processor (Python 3.12)
    │
    ▼ parse EVTX/XML → normalize → vendor API
Datadog / Splunk / New Relic / Grafana / Elastic / ...

The key shift from the EC2 pattern: Lambda does not mount the audit volume over NFS or SMB. It reads rotated ONTAP audit log files through an FSx for ONTAP S3 Access Point using S3 APIs, while the data itself remains on the FSx for ONTAP file system.

A note on FSx for ONTAP S3 Access Points: FSx for ONTAP S3 Access Points let applications use S3 APIs to access data that still resides on FSx for ONTAP volumes. They are excellent as a serverless access boundary, but they are not the same as standard S3 buckets. In particular, you should not rely on S3 ObjectCreated notifications from an FSx for ONTAP S3 Access Point. Instead, this project uses EventBridge Scheduler plus checkpointing to discover and process newly rotated audit log files.

Three Event Sources, One Architecture

FSx for ONTAP generates observability data through three distinct channels:

1. File Access Audit Logs (FSx for ONTAP S3 AP)

Depending on your ONTAP audit configuration and SACL/NFSv4 ACL settings, file operations such as create, delete, read, write, and permission changes can be recorded as ONTAP audit logs in EVTX or XML format.

Delivery: ONTAP writes rotated audit log files to an audit volume inside the SVM
Access path: Lambda reads those files through an FSx for ONTAP S3 Access Point
Trigger: EventBridge Scheduler invokes Lambda periodically; Lambda uses checkpointing to process newly rotated files
Compute: Lambda (scheduled, pay-per-invocation)
Latency: Near-real-time rather than sub-second streaming. End-to-end latency depends on your ONTAP audit log rotation interval and the EventBridge Scheduler frequency.
Use case: Compliance auditing, access pattern analysis, data governance

2. EMS (Event Management System) Webhooks

ONTAP's built-in event system can push critical alerts via HTTP webhooks. This includes:

Autonomous Ransomware Protection (ARP) alerts — ONTAP detects encryption patterns and fires an event
Quota threshold violations
Hardware failures
Replication issues
Delivery: ONTAP pushes HTTPS webhook to API Gateway
Trigger: API Gateway invocation (event-driven)
Compute: Lambda (behind API Gateway)
Use case: Security alerting, operational monitoring

3. FPolicy (File Policy) Events

FPolicy intercepts file operations at the protocol level (CIFS/NFS) and forwards them in real-time via a proprietary TCP protocol. Unlike the other two sources, FPolicy requires a persistent TCP listener — which is why this path uses ECS Fargate rather than Lambda.

Delivery: ONTAP connects to Fargate task via TCP:9898
Trigger: Fargate receives FPolicy events → enqueues to SQS → Lambda processes
Compute: ECS Fargate (TCP listener) + Lambda (vendor shipping)
Use case: File activity monitoring, DLP, suspicious behavior detection

Note: The FPolicy path is the one exception to the "pure Lambda" model. ONTAP's FPolicy protocol is a proprietary binary format over TCP — it cannot be received by API Gateway or Lambda directly. Fargate handles the protocol translation, then hands off to Lambda via SQS for the vendor-specific shipping. It's still EC2-free, but not entirely serverless in the strictest sense.

The Architecture

Each event source feeds into the same delivery pattern:

┌─────────────────────────────────────────────────────────────────┐
│                    FSx for ONTAP                                │
├──────────────┬──────────────────────┬───────────────────────────┤
│ File Access  │   EMS Webhook        │   FPolicy (TCP:9898)      │
│ Audit Logs   │                      │                           │
└──────┬───────┴──────────┬───────────┴───────────┬───────────────┘
       │                  │                       │
       ▼                  ▼                       ▼
  FSx S3 AP +        API Gateway            ECS Fargate
  Scheduler               │                       │
       │                  ▼                       ▼
       ▼             Lambda (EMS)           SQS → Lambda
  Lambda (parser)         │                       │
       │                  │                       │
       └──────────────────┼───────────────────────┘
                          ▼
              Observability Vendor API
              (Datadog, Splunk, New Relic, ...)

Each integration packages the parser and vendor shipper together in a single Lambda, but the pattern is the same: normalize ONTAP events, then send them to the vendor API. Swap the integration Lambda, and you switch vendors. Vendor-specific Lambdas are optimized for quick adoption and native API behavior, while the OpenTelemetry integration provides a vendor-neutral path for organizations standardizing on OTLP.

The Gotcha That Cost Me a Day

Here's something that isn't immediately obvious from the documentation:

In my validation, a Lambda function placed in a VPC with only an S3 Gateway Endpoint could not read from the FSx for ONTAP S3 Access Point and timed out. Adding NAT Gateway egress resolved the issue.

This gotcha matters because this project intentionally reads audit logs through FSx for ONTAP S3 Access Points rather than mounting the audit volume over NFS/SMB from an EC2 instance.

Tested with:

Lambda in private subnets (ap-northeast-1)
FSx for ONTAP S3 Access Point attached to an FSx volume
S3 Gateway VPC Endpoint only
No NAT Gateway
Failure mode: timeout (no response, not AccessDenied)

Your options:

Lambda Placement	FSx for ONTAP S3 AP Access	Recommendation
Outside VPC	✅ Works	Simplest for read-only access
VPC + NAT Gateway	✅ Works	Production recommended
VPC + S3 Gateway EP only	❌ Timeout	Not recommended based on this validation

This is based on my validation environment (ap-northeast-1). Always test the network path in your own account and Region, as AWS may update this behavior.

Target Vendors

The project targets 9 observability platforms. Datadog is fully verified end-to-end (the subject of Parts 2 and 3 of this series). The remaining vendors have initial implementations that I'll be verifying and writing about in upcoming posts:

Vendor	Delivery Method	Status
Datadog	Logs API v2	✅ E2E verified
Splunk	HEC (HTTP Event Collector)	🧪 Implementation ready, verification planned
New Relic	Log API v1	🧪 Implementation ready, verification planned
Grafana Cloud	Loki Push API	🧪 Implementation ready, verification planned
Elastic	Bulk API	🧪 Implementation ready, verification planned
Dynatrace	Log Ingest API v2	🧪 Implementation ready, verification planned
Sumo Logic	HTTP Source	🧪 Implementation ready, verification planned
Honeycomb	Events Batch API	🧪 Implementation ready, verification planned
OpenTelemetry	OTLP/HTTP (vendor-neutral)	🧪 Implementation ready, verification planned

Status definitions:

✅ E2E verified — Deployed and validated with real FSx for ONTAP audit logs
🧪 Implementation ready — Code and CloudFormation available; E2E validation pending
🚧 Planned — Design exists; implementation pending

Each vendor integration is designed as a self-contained CloudFormation stack with its own Lambda, IAM roles, DLQ, and CloudWatch alarms. As I verify each one, I'll publish a dedicated article with the results and any vendor-specific gotchas I encounter.

What's in the Repo

The project is structured for easy adoption:

fsxn-observability-integrations/
├── integrations/
│   ├── datadog/           # ✅ Verified: Lambda + CFn + tests + docs
│   ├── splunk-serverless/ # 🧪 Implementation ready
│   ├── new-relic/         # 🧪 Implementation ready
│   ├── grafana/           # 🧪 Implementation ready
│   ├── elastic/           # 🧪 Implementation ready
│   ├── dynatrace/         # 🧪 Implementation ready
│   ├── sumo-logic/        # 🧪 Implementation ready
│   ├── honeycomb/         # 🧪 Implementation ready
│   └── otel-collector/    # 🧪 Implementation ready
├── shared/
│   ├── lambda-layers/     # Reusable log parser (EVTX/XML) + S3 AP reader
│   ├── templates/         # Prerequisites CFn (EventBridge Scheduler, IAM)
│   └── scripts/           # Deploy + test utilities
└── docs/                  # Bilingual (EN/JA) documentation

The shared infrastructure (EventBridge Scheduler, log parser layer, IAM roles) is vendor-agnostic and already proven through the Datadog verification. Each vendor directory follows the same structure, so once you understand one, you understand them all. Each stack is designed to include DLQ, CloudWatch alarms, and operational visibility out of the box; the Datadog stack also includes the verified CloudWatch operational dashboard used during E2E validation.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

If you've been following my FSx for ONTAP S3 Access Points series, this project builds directly on those foundations:

FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — Where this journey started: using S3 APs as the bridge between ONTAP and serverless
Production-Ready FPolicy Event Pipeline Across 17 UCs — Phase 11 — The FPolicy pipeline that feeds into this observability project
Near-Real-Time Processing, ML Inference, and Observability — Phase 3 — Early architecture patterns that evolved into this multi-vendor approach

This observability integrations project is the natural next step: taking those serverless patterns and applying them specifically to audit log shipping across multiple vendors.

Design Considerations

Based on early feedback, here are key points for different audiences:

Design philosophy: The goal is not just to remove EC2. The goal is to move undifferentiated collector operations into managed services, make failures observable and replayable, and keep the integration layer small enough for customers to operate themselves.

Where this pattern matters: This pattern is especially useful for enterprise file workloads where auditability matters but EC2-based collectors add operational overhead —
departmental file shares, enterprise application interface directories such as SAP, Oracle, or SQL Server adjacent file shares, VDI/EUC home directories, engineering and design repositories, regulated file repositories, and ransomware investigation workflows.

Non-intrusive by design: This pipeline observes audit logs after ONTAP records them; it does not sit in the application data path. NFS/SMB access patterns are unchanged. No application code changes are required.

Telemetry ownership: This pattern treats ONTAP as the authoritative source of file activity telemetry, while AWS managed services provide the event processing and delivery layer.

Compliance note: This pattern helps centralize and analyze audit events, but retention, immutability, and regulatory controls should be designed according to your organization's compliance requirements. This is an audit log delivery pattern, not a compliance certification. For audit evidence, consider separately how long raw EVTX/XML files should be retained on the audit volume or archived outside the observability pipeline.

Audit policy dependency: The quality and volume of events depend heavily on your ONTAP audit policy, SACLs, NFSv4 ACLs, and rotation interval. Enabling read auditing on high-traffic volumes can produce significant log volume — design your audit policy carefully.

Cost variables: The biggest cost factors are audit event volume, log rotation frequency, EventBridge Scheduler frequency, Lambda runtime, NAT Gateway usage (if Lambda is in VPC), and vendor ingest pricing. Compared to the EC2 pattern, you trade always-on instance cost for pay-per-invocation compute and vendor-ingest-driven cost.

Multi-account deployment: This pattern can be deployed per workload account or centralized into a logging/security account, depending on your organization's landing zone design.

Reliability: The stack includes DLQ for failed events, CloudWatch alarms for error/throttle detection, and checkpointing to avoid reprocessing already completed audit log files. Delivery to external vendor APIs should be treated as at-least-once; DLQ messages can be replayed after resolving the root cause.

What's Coming Next

This is Part 1 of a series. In the upcoming posts, I'll deep-dive into:

Part 2: Implementing the Datadog integration end-to-end — from CloudFormation to seeing logs in the Datadog Log Explorer
Part 3: Event-driven ransomware detection using ONTAP's Autonomous Ransomware Protection (ARP) + EMS webhooks + Datadog alerting

Beyond this Datadog series, I'll be verifying and writing about each vendor integration as I go:

Replacing the EC2-based Splunk pattern with Lambda + HEC
OpenTelemetry as the vendor-neutral escape hatch
Grafana Cloud + Loki for the open-source stack
And more — each with its own E2E verification and lessons learned

The goal is to build a comprehensive, battle-tested pattern library where you can pick your vendor and deploy with confidence. Follow along as I work through each one.

Try It Yourself

The Datadog integration is fully verified and ready to deploy. You'll need:

An FSx for ONTAP file system with audit logging enabled
An FSx for ONTAP S3 Access Point attached to the audit volume

git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git
cd fsxn-observability-integrations

# Deploy Datadog integration
# (FsxS3AccessPointArn = your FSx for ONTAP S3 Access Point ARN)
aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=<your-fsx-s3-ap-arn> \
    DatadogApiKeySecretArn=<your-secret-arn> \
  --capabilities CAPABILITY_NAMED_IAM

This stack deploys the scheduled Lambda processor, IAM permissions for reading from the FSx for ONTAP S3 Access Point, checkpoint storage, DLQ, CloudWatch alarms, and the Datadog shipping logic. The processor keeps track of already-processed audit log files so each scheduled invocation only ships newly rotated logs.

After deployment, you should see:

EventBridge Scheduler invoking the Lambda processor on your configured interval
Checkpoint storage updated after processing rotated audit logs
Parsed FSx for ONTAP audit events arriving in Datadog Logs (source:fsxn)
CloudWatch alarms and DLQ ready for operational visibility

Full setup guide in the repo's Prerequisites doc.

Have questions or want to see a specific vendor integration verified next? Drop a comment below — it'll help me prioritize the series.

Next up: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way

DEV Community