TL;DR
Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.
What We're Building
In Part 1, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.
By the end of this post, you'll have:
- A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms
- ONTAP audit events flowing into Datadog Log Explorer
- Structured attributes (
@attributes.svm,@attributes.user,@attributes.operation,@attributes.path,@attributes.client_ip,@attributes.result) ready for search, filtering, and Datadog facet creation - An operational CloudWatch dashboard monitoring pipeline health
Prerequisites
Before deploying, you need:
- FSx for ONTAP file system with an SVM configured for audit logging
- FSx for ONTAP S3 Access Point attached to the audit volume
- Datadog account (free trial works) with an API Key
- API Key in Secrets Manager:
aws secretsmanager create-secret \
--name fsxn-datadog-api-key \
--secret-string '{"api_key":"<your-dd-api-key>"}' \
--region ap-northeast-1
- ONTAP audit logging enabled:
# Time-based rotation for quick validation
vserver audit create -vserver <svm-name> -destination /audit_log \
-events file-ops \
-format evtx \
-rotate-schedule-minute 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit enable -vserver <svm-name>
For quick validation, use time-based rotation. If you only use
-rotate-size, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the-eventslist based on what you want to audit.Important: Enabling
vserver auditis only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.
For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's ONTAP Audit Setup Guide and Operational Guide.
-
Verify how audit files appear via S3 API (to set
AuditLogPrefixcorrectly):
aws s3api list-objects-v2 \
--bucket <fsx-s3-access-point-arn-or-alias> \
--max-keys 10 \
--region ap-northeast-1
Set AuditLogPrefix to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.
Note:
/audit_logis the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify withlist-objects-v2.
The CloudFormation Stack
The Datadog integration deploys as a single self-contained stack:
aws cloudformation deploy \
--template-file integrations/datadog/template.yaml \
--stack-name fsxn-datadog-integration \
--parameter-overrides \
FsxS3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
DatadogSite=ap1.datadoghq.com \
AuditLogPrefix=<prefix-from-list-objects-v2> \
ScheduleRate="rate(5 minutes)" \
--capabilities CAPABILITY_NAMED_IAM \
--region ap-northeast-1
What Gets Created
| Resource | Purpose |
|---|---|
| Lambda Function | Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog |
| EventBridge Scheduler | Invokes Lambda every 5 minutes |
| Scheduler IAM Role | Allows Scheduler to invoke Lambda |
| Lambda Execution Role | S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions |
| Dead Letter Queue (SQS) | Captures failed events for replay |
| CloudWatch Alarms (3) | Errors, throttles, DLQ depth |
| CloudWatch Dashboard | Operational health: errors, duration, invocations, DLQ |
| CloudWatch Log Group | Lambda execution logs (30-day retention) |
Key Parameters
| Parameter | Required | Description |
|---|---|---|
FsxS3AccessPointArn |
✅ | FSx for ONTAP S3 Access Point ARN |
DatadogApiKeySecretArn |
✅ | Secrets Manager ARN for the API key |
DatadogSite |
❌ | Datadog site (default: ap1.datadoghq.com) |
ScheduleRate |
❌ | Processing frequency (default: rate(5 minutes)) |
AuditLogPrefix |
❌ | Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root. |
VpcEnabled |
❌ | Enable VPC config — requires NAT Gateway |
The Lambda Handler
The handler follows a straightforward flow:
Scheduled invocation
→ List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
→ Filter by checkpoint (skip already-processed files)
→ For each new file:
→ Read via S3 GetObject
→ Detect format (EVTX magic bytes or XML declaration)
→ Parse into normalized events
→ Format for Datadog Logs API v2
→ Batch (≤5MB, ≤1000 items per request)
→ Ship with exponential backoff (max 3 attempts)
→ Update checkpoint
Datadog API Limits
The Datadog Logs API v2 enforces the following per-request limits (docs):
- Maximum payload size (uncompressed): 5MB
- Maximum size for a single log: 1MB (larger logs are truncated, not rejected)
- Maximum array size: 1000 entries
The shipper batches conservatively below these limits.
Core Shipping Logic
def _ship_to_datadog(logs: list[dict], api_key: str) -> int:
"""Ship normalized logs to Datadog Logs Intake API v2.
If any batch fails after retries, raise an exception so the Lambda
invocation is treated as failed and the checkpoint is not advanced.
"""
shipped = 0
failed_batches = 0
for batch in _create_batches(logs):
if _send_batch(batch, api_key):
shipped += len(batch)
else:
failed_batches += 1
if failed_batches:
raise RuntimeError(f"{failed_batches} batch(es) failed after retries")
return shipped
Checkpoint Semantics
The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.
This makes the pipeline at-least-once: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.
Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.
Retry with Exponential Backoff
def _send_batch(batch: list[dict], api_key: str) -> bool:
"""Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts."""
for attempt in range(MAX_RETRIES):
response = http.request(
"POST",
DATADOG_LOGS_URL,
body=json.dumps(batch).encode("utf-8"),
headers={
"Content-Type": "application/json",
"DD-API-KEY": api_key,
},
)
if response.status < 300:
return True
if response.status == 429 or response.status >= 500:
time.sleep(2 ** attempt + random.uniform(0, 1)) # jitter
continue
# Client error (4xx) — don't retry
return False
return False
The implementation uses exponential backoff with jitter (2^attempt + random) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that MAX_RETRIES in the code represents the total number of attempts, not retries after an initial attempt.
API Key Caching
The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:
_api_key_cache: str | None = None
def get_api_key() -> str:
global _api_key_cache
if _api_key_cache:
return _api_key_cache
response = secrets_client.get_secret_value(SecretId=API_KEY_SECRET_ARN)
secret = json.loads(response["SecretString"])
_api_key_cache = secret.get("api_key", secret.get("dd_api_key", response["SecretString"]))
return _api_key_cache
Datadog Field Mapping
Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (ddsource, hostname, service, message) and custom attributes nested under attributes:
| Datadog Log Explorer | Payload Field | ONTAP Source | Example |
|---|---|---|---|
source |
ddsource |
Configured | fsxn |
service |
service |
Configured | fsxn-ontap |
host |
hostname |
SVM name | svm-prod-01 |
@attributes.svm |
attributes.svm |
SVMName / Computer | svm-prod-01 |
@attributes.user |
attributes.user |
UserName / SubjectUserName | admin@corp.local |
@attributes.client_ip |
attributes.client_ip |
ClientIP / IpAddress | 10.0.1.50 |
@attributes.operation |
attributes.operation |
Operation / ObjectType | ReadData |
@attributes.path |
attributes.path |
ObjectName | /vol/data/reports/q4.xlsx |
@attributes.result |
attributes.result |
Result / Keywords | Success |
@attributes.event_type |
attributes.event_type |
EventID | 4663 |
@attributes._pipeline.processed_at |
attributes._pipeline.processed_at |
Lambda timestamp | 2026-05-17T01:30:00Z |
@attributes._pipeline.source_file |
attributes._pipeline.source_file |
S3 object key | audit_log/audit_svm_20260517.evtx |
Set
DatadogSiteto your Datadog site, such asdatadoghq.com(US1),datadoghq.eu(EU1), orap1.datadoghq.com(AP1/Tokyo). The site determines the API endpoint.
For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.
Datadog Search Queries
# All FSx for ONTAP audit events
source:fsxn
# Failed access attempts
source:fsxn @attributes.result:Failure
# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"
# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"
# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*
In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.
Investigation Query Starters
When investigating an incident, start with these patterns:
| Question | Search query | Then group by |
|---|---|---|
| What did this user do? | source:fsxn @attributes.user:"suspect@corp.local" |
@attributes.operation or @attributes.path
|
| Who accessed this file? | source:fsxn @attributes.path:"/vol/data/secret.pdf" |
@attributes.user |
| Which clients generated failures? | source:fsxn @attributes.result:Failure |
@attributes.client_ip |
| Where are deletes concentrated? | source:fsxn @attributes.operation:delete |
@attributes.path or a path prefix |
| What happened on this SVM in the last hour? | source:fsxn @attributes.svm:svm-prod-01 |
@attributes.operation |
For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.
Operational Validation
Quick Validation (5–10 minutes)
With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.
Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.
# 0. Get stack outputs (log group name, DLQ URL, etc.)
aws cloudformation describe-stacks \
--stack-name fsxn-datadog-integration \
--query 'Stacks[0].Outputs' \
--region ap-northeast-1
# 1. Confirm Scheduler is invoking Lambda
aws logs filter-log-events \
--log-group-name <LambdaLogGroupName from outputs> \
--start-time $(python3 -c "import time; print(int((time.time()-300)*1000))") \
--region ap-northeast-1
# 2. Confirm DLQ is empty
aws sqs get-queue-attributes \
--queue-url <dlq-url> \
--attribute-names All \
--query 'Attributes.ApproximateNumberOfMessages'
# 3. Search in Datadog
# source:fsxn
CloudWatch Dashboard
The stack includes a pre-built dashboard (fsxn-datadog-integration-health) with:
- Lambda Errors & Throttles
- Lambda Duration (avg/max)
- Lambda Invocations
- DLQ Depth
For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.
What to Watch For
| Symptom | Likely Cause | Fix |
|---|---|---|
| No logs in Datadog | Scheduler not running, or no new audit files | Check CloudWatch Logs for Lambda invocations |
| Logs arrive but fields are empty | EVTX/XML parsing issue | Check @attributes.event_type — if "unknown", parser needs tuning |
| DLQ messages appearing | Datadog API rejection | Check API key validity, site configuration, timestamp age |
| Lambda timeout | S3 AP access issue (VPC Gateway EP?) | Verify NAT Gateway or deploy Lambda outside VPC |
Troubleshooting
Old Timestamps May Not Appear in Log Explorer
The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.
Fix: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.
Gzip Compression Issue (AP1 Site)
During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The ENABLE_GZIP parameter defaults to false for this reason.
S3 Access Point Timeout in VPC
If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.
Day-2 Operations
DLQ Replay
This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), sqs start-message-move-task cannot redrive messages automatically.
For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:
# Inspect failed messages
aws sqs receive-message \
--queue-url <dlq-url> \
--max-number-of-messages 1 \
--attribute-names All \
--message-attribute-names All
After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:
aws lambda invoke \
--function-name <lambda-function-name> \
--cli-binary-format raw-in-base64-out \
--payload '{}' \
--region ap-northeast-1 \
replay-output.json
In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.
For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.
Checkpoint Reset (Reprocess All Files)
⚠️ Warning: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.
aws dynamodb delete-item \
--table-name fsxn-observability-audit-checkpoint \
--key '{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'
Teardown
aws cloudformation delete-stack \
--stack-name fsxn-datadog-integration \
--region ap-northeast-1
Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.
Cost Estimate
For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):
| Component | Monthly Cost |
|---|---|
| Lambda (288 invocations/day × 5s avg) | ~$0.50 |
| EventBridge Scheduler | ~$0.01 |
| DynamoDB (checkpoint) | ~$0.01 |
| Secrets Manager | ~$0.40 |
| CloudWatch Logs (30-day) | ~$1.00 |
| NAT Gateway (if VPC) | Region-dependent hourly + per-GB |
| Total (no VPC) | ~$2/month |
| Total (with VPC/NAT) | ~$30–50+/month depending on Region |
Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the AWS Pricing Calculator for your target Region.
Important: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.
Evidence retention: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.
Cost control: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.
Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.
What's Next
In Part 3, we'll add event-driven security alerting:
- ONTAP Autonomous Ransomware Protection (ARP) detection
- EMS webhook → API Gateway → Lambda → Datadog
- Datadog Monitor configuration for instant alerts
- Incident response workflow
Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.
Questions about the Datadog integration? Drop a comment below.
Previous: Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Next: Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Top comments (0)