Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 17

Event-Driven Ransomware Detection with ONTAP ARP + Datadog

#aws #serverless #datadog #amazonfsxfornetappontap

TL;DR

ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.

The Threat Model

Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.

What if your storage could detect the encryption pattern before the host-based tools react?

That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:

Sudden spikes in file entropy (encryption)
Mass file extension changes (.docx → .encrypted)
Abnormal write patterns inconsistent with normal workload behavior

When ARP detects an attack, it changes the volume state to attack-detected and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.

The Detection Pipeline

In Part 2, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.

ONTAP ARP detects encryption behavior
    │
    ▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
    │
    ▼
API Gateway (REST endpoint)
    │
    ▼
Lambda (EMS handler)
    │
    ▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
    │
    ▼
Datadog Monitor → PagerDuty / Slack / Email

End-to-end latency: around 30 seconds in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.

Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.

Deploying the EMS Integration

The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.

aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
EMS Lambda	Receives EMS webhooks, normalizes, ships to Datadog
FPolicy Lambda	Receives FPolicy events from SQS, ships to Datadog
API Gateway (from shared EMS webhook stack)	HTTPS endpoint for ONTAP EMS webhooks
IAM Roles	Least-privilege for each Lambda
CloudWatch Log Groups	Execution logs

Webhook Security

For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.

ONTAP EMS Configuration

After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include arw.volume.state and other arw.* events you want to monitor. Refer to the NetApp EMS webhook documentation for destination and filter configuration.

The EMS Lambda Handler

The handler receives an API Gateway proxy event containing the EMS webhook payload:

def lambda_handler(event: dict, context: Any) -> dict:
    """Process EMS webhook from ONTAP via API Gateway."""
    api_key = get_api_key()
    request_id = _get_request_id(event)

    logger.info("EMS handler invoked: requestId=%s", request_id)

    # Extract EMS events from webhook body
    ems_events = _extract_ems_events(event)
    logger.info("Parsed %d EMS event(s)", len(ems_events))

    # Normalize to common schema
    normalized = _normalize_ems_events(ems_events)

    # Format for Datadog
    dd_logs = _format_for_datadog(normalized)

    # Ship to Datadog
    shipped = _ship_to_datadog(dd_logs, api_key)

    return _api_response(200, {
        "message": "EMS events processed",
        "total_events": len(ems_events),
        "shipped": shipped,
    })

EMS Event Normalization

ONTAP EMS events arrive with fields like messageName, severity, node, svmName, parameters. The handler normalizes them:

def _normalize_ems_events(events: list[dict]) -> list[dict]:
    """Normalize raw EMS events to internal schema."""
    normalized = []
    for event in events:
        normalized.append({
            "event_name": event.get("messageName", "unknown"),
            "severity": event.get("severity", "info"),
            "source_node": event.get("node", ""),
            "svm": event.get("svmName", ""),
            "message": event.get("message", json.dumps(event)),
            "parameters": event.get("parameters", {}),
            "timestamp": event.get("time", datetime.now(timezone.utc).isoformat()),
        })
    return normalized

Datadog Formatting (source:fsxn-ems)

def _format_for_datadog(events: list[dict]) -> list[dict]:
    """Format normalized EMS events for Datadog Logs API v2."""
    dd_logs = []
    for event in events:
        dd_logs.append({
            "ddsource": "fsxn-ems",
            "ddtags": f"source:fsxn-ems,service:{DD_SERVICE},env:{DD_ENV}",
            "hostname": event["source_node"],
            "service": DD_SERVICE,
            "message": event["message"],
            "date": event["timestamp"],
            "attributes": {
                "event_name": event["event_name"],
                "severity": event["severity"],
                "source_node": event["source_node"],
                "svm": event["svm"],
                "parameters": event["parameters"],
            },
        })
    return dd_logs

ARP Event Payload (Normalized by Lambda)

ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's _extract_ems_events() function parses the incoming API Gateway proxy event body, then _normalize_ems_events() produces the following internal schema:

{
  "event_name": "arw.volume.state",
  "severity": "alert",
  "source_node": "fsxn-node-01",
  "svm": "svm-prod-01",
  "timestamp": "2026-05-17T01:04:22Z",
  "message": "Anti-ransomware: Volume vol_data state changed to attack-detected",
  "parameters": {
    "volume_name": "vol_data",
    "state": "attack-detected"
  }
}

In Datadog, this arrives as:

source:fsxn-ems
host:fsxn-node-01
service:fsxn-ontap
@attributes.event_name:arw.volume.state
@attributes.severity:alert
@attributes.svm:svm-prod-01
@attributes.parameters.volume_name:vol_data
@attributes.parameters.state:attack-detected

Setting Up the Datadog Monitor

Create a Monitor that triggers on any ARP alert:

Monitor Configuration

Log Explorer search query:

source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected

Datadog Monitor API JSON:

{
  "name": "🚨 FSx for ONTAP: Ransomware Detected (ARP)",
  "type": "log alert",
  "query": "logs(\"source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected\").index(\"*\").rollup(\"count\").last(\"5m\") > 0",
  "message": "🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.\n\n**Volume**: {{attributes.parameters.volume_name}}\n**SVM**: {{attributes.svm}}\n**Node**: {{host}}\n**Time**: {{date}}\n\n## Recommended Actions\n1. Verify the ARP event in ONTAP and Datadog.\n2. Check FPolicy/audit logs for user/client IP correlation.\n3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.\n\n@pagerduty @slack-security-alerts",
  "options": {
    "thresholds": { "critical": 0 },
    "notify_no_data": false,
    "evaluation_delay": 0
  }
}

What This Monitor Does

Triggers on: Any arw.volume.state event with state:attack-detected
Threshold: Critical when count > 0 in a 5-minute window
Notification: PagerDuty + Slack with volume name, SVM, and response steps
No-data handling: Disabled (absence of ARP events is normal)

Adjust template variables ({{attributes.*}}, {{host}}, {{date}}) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.

FPolicy: The Complementary Signal

While ARP detects the encryption pattern, FPolicy provides the file-level detail. Together they answer:

Question	Source
Is ransomware active?	ARP (EMS)
Which files are affected?	FPolicy
Who is doing it?	FPolicy (`user` field)
From where?	FPolicy (`client_ip` field)
What operations?	FPolicy (`operation`: create, write, rename, delete)

FPolicy Event in Datadog

source:fsxn-fpolicy
@attributes.operation:create
@attributes.file_path:/vol/data/finance/confidential_report.xlsx
@attributes.user:suspicious_user@corp.local
@attributes.client_ip:10.0.1.55
@attributes.protocol:cifs

Correlation Query

After an ARP alert, investigate with FPolicy data:

source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)

This shows all file modifications on the affected SVM, helping identify the responsible user and client.

Incident Response Workflow

1. ARP fires → EMS webhook → Datadog alert (around 30 seconds)
     │
2. Responder receives PagerDuty/Slack notification
     │
3. Verify in Datadog and ONTAP:
   - source:fsxn-ems → confirm ARP event details
   - source:fsxn-fpolicy → identify user, IP, affected files
   - ONTAP: security anti-ransomware volume show
     │
4. Correlate and assess:
   - Is this a true positive or legitimate bulk operation?
   - What is the blast radius (volumes, files, users)?
     │
5. Containment (only after verification, per approved runbook):
   - Create snapshot (preserve recovery point)
   - Restrict volume access if confirmed malicious
   - Review ARP suspect list
     │
6. Recovery:
   - Restore from snapshot (pre-attack state)
   - Re-enable access after containment
   - Update audit policies if gaps found

Important: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.

For a more detailed role-based runbook, see the repository's ARP Incident Response Guide.

Beyond ARP: Other EMS Use Cases

The same EMS webhook pipeline handles other critical ONTAP events:

EMS Event	Severity	Use Case
`arw.volume.state`	alert	Ransomware detection
`wafl.quota.softlimit.exceeded`	warning	Capacity planning
`wafl.quota.hardlimit.exceeded`	error	Immediate capacity action
`cf.fsm.takeover`	alert	HA failover notification
`sms.vol.full`	error	Volume full — data at risk
`net.linkDown`	warning	Network connectivity issue

All arrive in Datadog as source:fsxn-ems with the event name in @attributes.event_name, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Validation Results

This integration was validated end-to-end:

Test	Result	Latency
ARP event → Datadog	✅ Arrived	~30 seconds
Quota exceeded → Datadog	✅ Arrived	~30 seconds
FPolicy file create → Datadog	✅ Arrived (via SQS → Lambda path)	~30 seconds
Lambda error handling	✅ DLQ capture	—
API key from Secrets Manager	✅ Cached	—

Validation performed in ap-northeast-1 with the deployed fsxn-datadog-ems-fpolicy stack.

Design Considerations for Security Teams

Webhook security: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.

Detection latency: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.

False positives: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.

Coverage: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify security anti-ransomware volume show before relying on alerts.

Audit trail: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.

Cost profile: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.

Try It Yourself

If you want the shortest path to a first successful ARP alert test, see the repository's minimum quick start.

The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.

# Deploy EMS + FPolicy integration
aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=<your-secret-arn> \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM

# Create a test event file
cat > arp-test-event.json <<EOF
{
  "body": "{\"messageName\":\"arw.volume.state\",\"severity\":\"alert\",\"node\":\"fsxn-node-01\",\"svmName\":\"svm-prod-01\",\"time\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"message\":\"Anti-ransomware: Volume vol_data state changed to attack-detected\",\"parameters\":{\"volume_name\":\"vol_data\",\"state\":\"attack-detected\"}}",
  "requestContext": {"requestId": "test"}
}
EOF

# Invoke Lambda with the test event
aws lambda invoke \
  --function-name fsxn-datadog-ems-fpolicy-ems \
  --payload file://arp-test-event.json \
  --cli-binary-format raw-in-base64-out \
  --region ap-northeast-1 \
  arp-test-output.json

# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state

What's Next

This completes the Datadog series:

Part 1: Architecture and project introduction
Part 2: Audit log pipeline implementation
Part 3: Event-driven ransomware detection (this post)

Coming up next in the series:

Splunk: Replacing EC2 + Universal Forwarder with Lambda + HEC
OpenTelemetry: The vendor-neutral escape hatch
Grafana Cloud: Loki Push API with label cardinality guidance

Each will follow the same pattern: deploy, validate, document the gotchas.

Have questions about ARP detection or the EMS pipeline? Drop a comment below.

Previous: Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

DEV Community