Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 17

Shipping FSx for ONTAP Logs to Datadog — The Serverless Way

#aws #serverless #datadog #amazonfsxfornetappontap

TL;DR

Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.

What We're Building

In Part 1, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.

By the end of this post, you'll have:

A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms
ONTAP audit events flowing into Datadog Log Explorer
Structured attributes (@attributes.svm, @attributes.user, @attributes.operation, @attributes.path, @attributes.client_ip, @attributes.result) ready for search, filtering, and Datadog facet creation
An operational CloudWatch dashboard monitoring pipeline health

Prerequisites

Before deploying, you need:

FSx for ONTAP file system with an SVM configured for audit logging
FSx for ONTAP S3 Access Point attached to the audit volume
Datadog account (free trial works) with an API Key
API Key in Secrets Manager:

aws secretsmanager create-secret \
  --name fsxn-datadog-api-key \
  --secret-string '{"api_key":"<your-dd-api-key>"}' \
  --region ap-northeast-1

ONTAP audit logging enabled:

# Time-based rotation for quick validation
vserver audit create -vserver <svm-name> -destination /audit_log \
  -events file-ops \
  -format evtx \
  -rotate-schedule-minute 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit enable -vserver <svm-name>

For quick validation, use time-based rotation. If you only use -rotate-size, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the -events list based on what you want to audit.

Important: Enabling vserver audit is only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.

For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's ONTAP Audit Setup Guide and Operational Guide.

Verify how audit files appear via S3 API (to set AuditLogPrefix correctly):

aws s3api list-objects-v2 \
  --bucket <fsx-s3-access-point-arn-or-alias> \
  --max-keys 10 \
  --region ap-northeast-1

Set AuditLogPrefix to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.

Note: /audit_log is the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify with list-objects-v2.

The CloudFormation Stack

The Datadog integration deploys as a single self-contained stack:

aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
    AuditLogPrefix=<prefix-from-list-objects-v2> \
    ScheduleRate="rate(5 minutes)" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
Lambda Function	Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog
EventBridge Scheduler	Invokes Lambda every 5 minutes
Scheduler IAM Role	Allows Scheduler to invoke Lambda
Lambda Execution Role	S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions
Dead Letter Queue (SQS)	Captures failed events for replay
CloudWatch Alarms (3)	Errors, throttles, DLQ depth
CloudWatch Dashboard	Operational health: errors, duration, invocations, DLQ
CloudWatch Log Group	Lambda execution logs (30-day retention)

Key Parameters

Parameter	Required	Description
`FsxS3AccessPointArn`	✅	FSx for ONTAP S3 Access Point ARN
`DatadogApiKeySecretArn`	✅	Secrets Manager ARN for the API key
`DatadogSite`	❌	Datadog site (default: `ap1.datadoghq.com`)
`ScheduleRate`	❌	Processing frequency (default: `rate(5 minutes)`)
`AuditLogPrefix`	❌	Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root.
`VpcEnabled`	❌	Enable VPC config — requires NAT Gateway

The Lambda Handler

The handler follows a straightforward flow:

Scheduled invocation
  → List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
  → Filter by checkpoint (skip already-processed files)
  → For each new file:
      → Read via S3 GetObject
      → Detect format (EVTX magic bytes or XML declaration)
      → Parse into normalized events
      → Format for Datadog Logs API v2
      → Batch (≤5MB, ≤1000 items per request)
      → Ship with exponential backoff (max 3 attempts)
  → Update checkpoint

Datadog API Limits

The Datadog Logs API v2 enforces the following per-request limits (docs):

Maximum payload size (uncompressed): 5MB
Maximum size for a single log: 1MB (larger logs are truncated, not rejected)
Maximum array size: 1000 entries

The shipper batches conservatively below these limits.

Core Shipping Logic

def _ship_to_datadog(logs: list[dict], api_key: str) -> int:
    """Ship normalized logs to Datadog Logs Intake API v2.

    If any batch fails after retries, raise an exception so the Lambda
    invocation is treated as failed and the checkpoint is not advanced.
    """
    shipped = 0
    failed_batches = 0

    for batch in _create_batches(logs):
        if _send_batch(batch, api_key):
            shipped += len(batch)
        else:
            failed_batches += 1

    if failed_batches:
        raise RuntimeError(f"{failed_batches} batch(es) failed after retries")

    return shipped

Checkpoint Semantics

The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.

This makes the pipeline at-least-once: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.

Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.

Retry with Exponential Backoff

def _send_batch(batch: list[dict], api_key: str) -> bool:
    """Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts."""
    for attempt in range(MAX_RETRIES):
        response = http.request(
            "POST",
            DATADOG_LOGS_URL,
            body=json.dumps(batch).encode("utf-8"),
            headers={
                "Content-Type": "application/json",
                "DD-API-KEY": api_key,
            },
        )
        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            time.sleep(2 ** attempt + random.uniform(0, 1))  # jitter
            continue
        # Client error (4xx) — don't retry
        return False
    return False

The implementation uses exponential backoff with jitter (2^attempt + random) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that MAX_RETRIES in the code represents the total number of attempts, not retries after an initial attempt.

API Key Caching

The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:

_api_key_cache: str | None = None

def get_api_key() -> str:
    global _api_key_cache
    if _api_key_cache:
        return _api_key_cache
    response = secrets_client.get_secret_value(SecretId=API_KEY_SECRET_ARN)
    secret = json.loads(response["SecretString"])
    _api_key_cache = secret.get("api_key", secret.get("dd_api_key", response["SecretString"]))
    return _api_key_cache

Datadog Field Mapping

Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (ddsource, hostname, service, message) and custom attributes nested under attributes:

Datadog Log Explorer	Payload Field	ONTAP Source	Example
`source`	`ddsource`	Configured	`fsxn`
`service`	`service`	Configured	`fsxn-ontap`
`host`	`hostname`	SVM name	`svm-prod-01`
`@attributes.svm`	`attributes.svm`	SVMName / Computer	`svm-prod-01`
`@attributes.user`	`attributes.user`	UserName / SubjectUserName	`admin@corp.local`
`@attributes.client_ip`	`attributes.client_ip`	ClientIP / IpAddress	`10.0.1.50`
`@attributes.operation`	`attributes.operation`	Operation / ObjectType	`ReadData`
`@attributes.path`	`attributes.path`	ObjectName	`/vol/data/reports/q4.xlsx`
`@attributes.result`	`attributes.result`	Result / Keywords	`Success`
`@attributes.event_type`	`attributes.event_type`	EventID	`4663`
`@attributes._pipeline.processed_at`	`attributes._pipeline.processed_at`	Lambda timestamp	`2026-05-17T01:30:00Z`
`@attributes._pipeline.source_file`	`attributes._pipeline.source_file`	S3 object key	`audit_log/audit_svm_20260517.evtx`

Set DatadogSite to your Datadog site, such as datadoghq.com (US1), datadoghq.eu (EU1), or ap1.datadoghq.com (AP1/Tokyo). The site determines the API endpoint.

For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Datadog Search Queries

# All FSx for ONTAP audit events
source:fsxn

# Failed access attempts
source:fsxn @attributes.result:Failure

# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"

# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"

# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*

In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.

Investigation Query Starters

When investigating an incident, start with these patterns:

Question	Search query	Then group by
What did this user do?	`source:fsxn @attributes.user:"suspect@corp.local"`	`@attributes.operation` or `@attributes.path`
Who accessed this file?	`source:fsxn @attributes.path:"/vol/data/secret.pdf"`	`@attributes.user`
Which clients generated failures?	`source:fsxn @attributes.result:Failure`	`@attributes.client_ip`
Where are deletes concentrated?	`source:fsxn @attributes.operation:delete`	`@attributes.path` or a path prefix
What happened on this SVM in the last hour?	`source:fsxn @attributes.svm:svm-prod-01`	`@attributes.operation`

For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.

Operational Validation

Quick Validation (5–10 minutes)

With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.

Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.

# 0. Get stack outputs (log group name, DLQ URL, etc.)
aws cloudformation describe-stacks \
  --stack-name fsxn-datadog-integration \
  --query 'Stacks[0].Outputs' \
  --region ap-northeast-1

# 1. Confirm Scheduler is invoking Lambda
aws logs filter-log-events \
  --log-group-name <LambdaLogGroupName from outputs> \
  --start-time $(python3 -c "import time; print(int((time.time()-300)*1000))") \
  --region ap-northeast-1

# 2. Confirm DLQ is empty
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names All \
  --query 'Attributes.ApproximateNumberOfMessages'

# 3. Search in Datadog
#    source:fsxn

CloudWatch Dashboard

The stack includes a pre-built dashboard (fsxn-datadog-integration-health) with:

Lambda Errors & Throttles
Lambda Duration (avg/max)
Lambda Invocations
DLQ Depth

For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.

What to Watch For

Symptom	Likely Cause	Fix
No logs in Datadog	Scheduler not running, or no new audit files	Check CloudWatch Logs for Lambda invocations
Logs arrive but fields are empty	EVTX/XML parsing issue	Check `@attributes.event_type` — if "unknown", parser needs tuning
DLQ messages appearing	Datadog API rejection	Check API key validity, site configuration, timestamp age
Lambda timeout	S3 AP access issue (VPC Gateway EP?)	Verify NAT Gateway or deploy Lambda outside VPC

Troubleshooting

Old Timestamps May Not Appear in Log Explorer

The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.

Fix: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.

Gzip Compression Issue (AP1 Site)

During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The ENABLE_GZIP parameter defaults to false for this reason.

S3 Access Point Timeout in VPC

If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.

Day-2 Operations

DLQ Replay

This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), sqs start-message-move-task cannot redrive messages automatically.

For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:

# Inspect failed messages
aws sqs receive-message \
  --queue-url <dlq-url> \
  --max-number-of-messages 1 \
  --attribute-names All \
  --message-attribute-names All

After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:

aws lambda invoke \
  --function-name <lambda-function-name> \
  --cli-binary-format raw-in-base64-out \
  --payload '{}' \
  --region ap-northeast-1 \
  replay-output.json

In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.

For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.

Checkpoint Reset (Reprocess All Files)

⚠️ Warning: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.

aws dynamodb delete-item \
  --table-name fsxn-observability-audit-checkpoint \
  --key '{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'

Teardown

aws cloudformation delete-stack \
  --stack-name fsxn-datadog-integration \
  --region ap-northeast-1

Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.

Cost Estimate

For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):

Component	Monthly Cost
Lambda (288 invocations/day × 5s avg)	~$0.50
EventBridge Scheduler	~$0.01
DynamoDB (checkpoint)	~$0.01
Secrets Manager	~$0.40
CloudWatch Logs (30-day)	~$1.00
NAT Gateway (if VPC)	Region-dependent hourly + per-GB
Total (no VPC)	~$2/month
Total (with VPC/NAT)	~$30–50+/month depending on Region

Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the AWS Pricing Calculator for your target Region.

Important: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.

Evidence retention: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.

Cost control: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.

Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.

What's Next

In Part 3, we'll add event-driven security alerting:

ONTAP Autonomous Ransomware Protection (ARP) detection
EMS webhook → API Gateway → Lambda → Datadog
Datadog Monitor configuration for instant alerts
Incident response workflow

Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.

Questions about the Datadog integration? Drop a comment below.

Previous: Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Next: Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog

DEV Community