TL;DR
ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.
The Threat Model
Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.
What if your storage could detect the encryption pattern before the host-based tools react?
That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:
- Sudden spikes in file entropy (encryption)
- Mass file extension changes (
.docx→.encrypted) - Abnormal write patterns inconsistent with normal workload behavior
When ARP detects an attack, it changes the volume state to attack-detected and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.
The Detection Pipeline
In Part 2, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.
ONTAP ARP detects encryption behavior
│
▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
│
▼
API Gateway (REST endpoint)
│
▼
Lambda (EMS handler)
│
▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
│
▼
Datadog Monitor → PagerDuty / Slack / Email
End-to-end latency: around 30 seconds in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.
Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.
Deploying the EMS Integration
The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.
aws cloudformation deploy \
--template-file integrations/datadog/template-ems-fpolicy.yaml \
--stack-name fsxn-datadog-ems-fpolicy \
--parameter-overrides \
DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
DatadogSite=ap1.datadoghq.com \
--capabilities CAPABILITY_NAMED_IAM \
--region ap-northeast-1
What Gets Created
| Resource | Purpose |
|---|---|
| EMS Lambda | Receives EMS webhooks, normalizes, ships to Datadog |
| FPolicy Lambda | Receives FPolicy events from SQS, ships to Datadog |
| API Gateway (from shared EMS webhook stack) | HTTPS endpoint for ONTAP EMS webhooks |
| IAM Roles | Least-privilege for each Lambda |
| CloudWatch Log Groups | Execution logs |
Webhook Security
For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.
ONTAP EMS Configuration
After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include arw.volume.state and other arw.* events you want to monitor. Refer to the NetApp EMS webhook documentation for destination and filter configuration.
The EMS Lambda Handler
The handler receives an API Gateway proxy event containing the EMS webhook payload:
def lambda_handler(event: dict, context: Any) -> dict:
"""Process EMS webhook from ONTAP via API Gateway."""
api_key = get_api_key()
request_id = _get_request_id(event)
logger.info("EMS handler invoked: requestId=%s", request_id)
# Extract EMS events from webhook body
ems_events = _extract_ems_events(event)
logger.info("Parsed %d EMS event(s)", len(ems_events))
# Normalize to common schema
normalized = _normalize_ems_events(ems_events)
# Format for Datadog
dd_logs = _format_for_datadog(normalized)
# Ship to Datadog
shipped = _ship_to_datadog(dd_logs, api_key)
return _api_response(200, {
"message": "EMS events processed",
"total_events": len(ems_events),
"shipped": shipped,
})
EMS Event Normalization
ONTAP EMS events arrive with fields like messageName, severity, node, svmName, parameters. The handler normalizes them:
def _normalize_ems_events(events: list[dict]) -> list[dict]:
"""Normalize raw EMS events to internal schema."""
normalized = []
for event in events:
normalized.append({
"event_name": event.get("messageName", "unknown"),
"severity": event.get("severity", "info"),
"source_node": event.get("node", ""),
"svm": event.get("svmName", ""),
"message": event.get("message", json.dumps(event)),
"parameters": event.get("parameters", {}),
"timestamp": event.get("time", datetime.now(timezone.utc).isoformat()),
})
return normalized
Datadog Formatting (source:fsxn-ems)
def _format_for_datadog(events: list[dict]) -> list[dict]:
"""Format normalized EMS events for Datadog Logs API v2."""
dd_logs = []
for event in events:
dd_logs.append({
"ddsource": "fsxn-ems",
"ddtags": f"source:fsxn-ems,service:{DD_SERVICE},env:{DD_ENV}",
"hostname": event["source_node"],
"service": DD_SERVICE,
"message": event["message"],
"date": event["timestamp"],
"attributes": {
"event_name": event["event_name"],
"severity": event["severity"],
"source_node": event["source_node"],
"svm": event["svm"],
"parameters": event["parameters"],
},
})
return dd_logs
ARP Event Payload (Normalized by Lambda)
ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's _extract_ems_events() function parses the incoming API Gateway proxy event body, then _normalize_ems_events() produces the following internal schema:
{
"event_name": "arw.volume.state",
"severity": "alert",
"source_node": "fsxn-node-01",
"svm": "svm-prod-01",
"timestamp": "2026-05-17T01:04:22Z",
"message": "Anti-ransomware: Volume vol_data state changed to attack-detected",
"parameters": {
"volume_name": "vol_data",
"state": "attack-detected"
}
}
In Datadog, this arrives as:
source:fsxn-ems
host:fsxn-node-01
service:fsxn-ontap
@attributes.event_name:arw.volume.state
@attributes.severity:alert
@attributes.svm:svm-prod-01
@attributes.parameters.volume_name:vol_data
@attributes.parameters.state:attack-detected
Setting Up the Datadog Monitor
Create a Monitor that triggers on any ARP alert:
Monitor Configuration
Log Explorer search query:
source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected
Datadog Monitor API JSON:
{
"name": "🚨 FSx for ONTAP: Ransomware Detected (ARP)",
"type": "log alert",
"query": "logs(\"source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected\").index(\"*\").rollup(\"count\").last(\"5m\") > 0",
"message": "🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.\n\n**Volume**: {{attributes.parameters.volume_name}}\n**SVM**: {{attributes.svm}}\n**Node**: {{host}}\n**Time**: {{date}}\n\n## Recommended Actions\n1. Verify the ARP event in ONTAP and Datadog.\n2. Check FPolicy/audit logs for user/client IP correlation.\n3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.\n\n@pagerduty @slack-security-alerts",
"options": {
"thresholds": { "critical": 0 },
"notify_no_data": false,
"evaluation_delay": 0
}
}
What This Monitor Does
-
Triggers on: Any
arw.volume.stateevent withstate:attack-detected - Threshold: Critical when count > 0 in a 5-minute window
- Notification: PagerDuty + Slack with volume name, SVM, and response steps
- No-data handling: Disabled (absence of ARP events is normal)
Adjust template variables (
{{attributes.*}},{{host}},{{date}}) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.
FPolicy: The Complementary Signal
While ARP detects the encryption pattern, FPolicy provides the file-level detail. Together they answer:
| Question | Source |
|---|---|
| Is ransomware active? | ARP (EMS) |
| Which files are affected? | FPolicy |
| Who is doing it? | FPolicy (user field) |
| From where? | FPolicy (client_ip field) |
| What operations? | FPolicy (operation: create, write, rename, delete) |
FPolicy Event in Datadog
source:fsxn-fpolicy
@attributes.operation:create
@attributes.file_path:/vol/data/finance/confidential_report.xlsx
@attributes.user:suspicious_user@corp.local
@attributes.client_ip:10.0.1.55
@attributes.protocol:cifs
Correlation Query
After an ARP alert, investigate with FPolicy data:
source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)
This shows all file modifications on the affected SVM, helping identify the responsible user and client.
Incident Response Workflow
1. ARP fires → EMS webhook → Datadog alert (around 30 seconds)
│
2. Responder receives PagerDuty/Slack notification
│
3. Verify in Datadog and ONTAP:
- source:fsxn-ems → confirm ARP event details
- source:fsxn-fpolicy → identify user, IP, affected files
- ONTAP: security anti-ransomware volume show
│
4. Correlate and assess:
- Is this a true positive or legitimate bulk operation?
- What is the blast radius (volumes, files, users)?
│
5. Containment (only after verification, per approved runbook):
- Create snapshot (preserve recovery point)
- Restrict volume access if confirmed malicious
- Review ARP suspect list
│
6. Recovery:
- Restore from snapshot (pre-attack state)
- Re-enable access after containment
- Update audit policies if gaps found
Important: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.
For a more detailed role-based runbook, see the repository's ARP Incident Response Guide.
Beyond ARP: Other EMS Use Cases
The same EMS webhook pipeline handles other critical ONTAP events:
| EMS Event | Severity | Use Case |
|---|---|---|
arw.volume.state |
alert | Ransomware detection |
wafl.quota.softlimit.exceeded |
warning | Capacity planning |
wafl.quota.hardlimit.exceeded |
error | Immediate capacity action |
cf.fsm.takeover |
alert | HA failover notification |
sms.vol.full |
error | Volume full — data at risk |
net.linkDown |
warning | Network connectivity issue |
All arrive in Datadog as source:fsxn-ems with the event name in @attributes.event_name, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.
Validation Results
This integration was validated end-to-end:
| Test | Result | Latency |
|---|---|---|
| ARP event → Datadog | ✅ Arrived | ~30 seconds |
| Quota exceeded → Datadog | ✅ Arrived | ~30 seconds |
| FPolicy file create → Datadog | ✅ Arrived (via SQS → Lambda path) | ~30 seconds |
| Lambda error handling | ✅ DLQ capture | — |
| API key from Secrets Manager | ✅ Cached | — |
Validation performed in ap-northeast-1 with the deployed
fsxn-datadog-ems-fpolicystack.
Design Considerations for Security Teams
Webhook security: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.
Detection latency: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.
False positives: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.
Coverage: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify security anti-ransomware volume show before relying on alerts.
Audit trail: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.
Cost profile: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.
Try It Yourself
If you want the shortest path to a first successful ARP alert test, see the repository's minimum quick start.
The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.
# Deploy EMS + FPolicy integration
aws cloudformation deploy \
--template-file integrations/datadog/template-ems-fpolicy.yaml \
--stack-name fsxn-datadog-ems-fpolicy \
--parameter-overrides \
DatadogApiKeySecretArn=<your-secret-arn> \
DatadogSite=ap1.datadoghq.com \
--capabilities CAPABILITY_NAMED_IAM
# Create a test event file
cat > arp-test-event.json <<EOF
{
"body": "{\"messageName\":\"arw.volume.state\",\"severity\":\"alert\",\"node\":\"fsxn-node-01\",\"svmName\":\"svm-prod-01\",\"time\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"message\":\"Anti-ransomware: Volume vol_data state changed to attack-detected\",\"parameters\":{\"volume_name\":\"vol_data\",\"state\":\"attack-detected\"}}",
"requestContext": {"requestId": "test"}
}
EOF
# Invoke Lambda with the test event
aws lambda invoke \
--function-name fsxn-datadog-ems-fpolicy-ems \
--payload file://arp-test-event.json \
--cli-binary-format raw-in-base64-out \
--region ap-northeast-1 \
arp-test-output.json
# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state
What's Next
This completes the Datadog series:
- Part 1: Architecture and project introduction
- Part 2: Audit log pipeline implementation
- Part 3: Event-driven ransomware detection (this post)
Coming up next in the series:
- Splunk: Replacing EC2 + Universal Forwarder with Lambda + HEC
- OpenTelemetry: The vendor-neutral escape hatch
- Grafana Cloud: Loki Push API with label cardinality guidance
Each will follow the same pattern: deploy, validate, document the gotchas.
Have questions about ARP detection or the EMS pipeline? Drop a comment below.
Previous: Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations




Top comments (0)