Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on Jun 14

From Logs to Detection: Building a File Security Pipeline in Datadog for FSx for ONTAP

#aws #datadog #security #amazonfsxfornetappontap

TL;DR

For security teams: Four threshold monitors + one ML anomaly monitor + four Cloud SIEM detection rules catch mass file deletion, data exfiltration, permission tampering, and unusual geography in under 5 minutes. EventID codes become human-readable operation names. All created via Datadog API — no manual clicking required.

For engineers: The Log Pipeline (6 processors including GeoIP enrichment) transforms raw ONTAP XML events into searchable, alertable fields. Five Saved Views cover the most common investigation patterns. Total AWS cost remains ~$1.50/month.

For architects: This pattern demonstrates the full detection-to-response lifecycle — from raw file system audit events through enrichment, categorization, alerting, cross-service correlation, and automated remediation — entirely serverless, entirely as code.

FSx for ONTAP → S3 Access Point → Lambda → Datadog Logs API v2
                                              │
                                              ▼
                                   ┌──────────────────────┐
                                   │ Log Pipeline (6)     │
                                   │  • Category Processor│
                                   │  • Status Remapper   │
                                   │  • Date Remapper     │
                                   │  • Attribute Remapper│
                                   │  • GeoIP Enrichment  │
                                   └──────────┬───────────┘
                                              │
                          ┌───────────────────┼───────────────────┐
                          ▼                   ▼                   ▼
                ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
                │ Log Monitors (4)│ │ Cloud SIEM (4)  │ │ Log Archive     │
                │ + Anomaly (1)   │ │ Security Signals│ │ → S3 → Glacier  │
                └────────┬────────┘ └────────┬────────┘ └─────────────────┘
                         │                   │
                         ▼                   ▼
                ┌──────────────────────────────────────┐
                │ Workflow → Slack + Case + Lambda     │
                │          (ONTAP Snapshot remediation)│
                └──────────────────────────────────────┘

Reading guide: Sections 1-9 cover the core pipeline (30 min to deploy). Sections 10+ cover advanced SOC integration (additional 15 min). Jump to Deployment if you want to start immediately.

Business value: This pipeline reduces mean-time-to-detect (MTTD) for file-based threats from "hours/days" (batch log review) to under 5 minutes (real-time alerting). For regulated environments, it provides continuous compliance evidence with automated archival — replacing manual quarterly reviews with always-on monitoring.

This is Part 16 of Serverless Observability for FSx for ONTAP.

Why Post-Ingestion Processing Matters

In Part 1, we shipped raw audit events to Datadog. That's necessary but insufficient. Raw events look like this:

{
  "event_type": "4660",
  "user": "CORP\\contractor-ext-03",
  "path": "/share/engineering/designs/prototype-v3.dwg",
  "result": "Audit Failure"
}

Questions that raw logs can't answer quickly:

What is EventID 4660? (Answer: Object Delete)
Should this alarm? (Answer: depends on volume and context)
Who should investigate? (Answer: Storage team + SOC)
What's the user's baseline? (Answer: need faceted historical view)

This article builds the processing layers that transform raw data into actionable security intelligence.

Architecture

Why EventBridge Scheduler? FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge object-level events. Lambda polls on a schedule and uses checkpointing to process only new files. XML format is chosen for Lambda-native parsing without binary dependencies (vs EVTX which requires Windows-specific libraries).

┌──────────────────────────────────────────────────────────────────┐
│ Lambda Handler (Python 3.12)                                     │
│  • Parse XML → Normalize → HEC-compatible JSON                   │
│  • POST to Datadog Logs API v2                                   │
│  • Fields: event_type, user, path, client_ip, svm, result        │
└──────────────────────────────────────────────────────────────────┘
        │  HTTP 202
        ▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Log Pipeline: "FSx for ONTAP Audit Logs"                 │
│ Filter: source:fsxn                                              │
│                                                                  │
│ 1. Category Processor → @operation_name                          │
│    4663→Object Access, 4660→Object Delete, 4656→Handle Request   │
│                                                                  │
│ 2. Status Remapper → log severity from @result                   │
│    "Audit Success" → info, "Audit Failure" → error               │
│                                                                  │
│ 3. Date Remapper → @timestamp as official log time               │
│                                                                  │
│ 4. Attribute Remapper → @user→usr.id, @client_ip→network.client  │
└──────────────────────────────────────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Security Monitors (3)                                    │
│                                                                  │
│ • [FSxN] Mass File Deletion    → >50 deletes/5min per user       │
│ • [FSxN] Abnormal Access Volume → >1000 accesses/1h per user     │
│ • [FSxN] Access Failure Spike  → >10 failures/15min per user     │
└──────────────────────────────────────────────────────────────────┘

The Log Pipeline

Why a Pipeline?

Without a pipeline, every query requires remembering that 4660 means "delete" and 4663 means "access." With a pipeline, you search @operation_name:Object Delete — and Datadog handles the translation at ingest time.

Creating the Pipeline via API

import json, urllib3, boto3

sm = boto3.client('secretsmanager', region_name='ap-northeast-1')
api_key = sm.get_secret_value(SecretId='fsxn-datadog-api-key')['SecretString']
app_key = sm.get_secret_value(SecretId='datadog/fsxn-app-key')['SecretString']

http = urllib3.PoolManager()

pipeline = {
    "name": "FSx for ONTAP Audit Logs",
    "is_enabled": True,
    "filter": {"query": "source:fsxn"},
    "processors": [
        {
            "type": "category-processor",
            "name": "EventID to Operation Name",
            "is_enabled": True,
            "target": "operation_name",
            "categories": [
                {"filter": {"query": "@event_type:4663"}, "name": "Object Access"},
                {"filter": {"query": "@event_type:4656"}, "name": "Handle Request"},
                {"filter": {"query": "@event_type:4660"}, "name": "Object Delete"},
                {"filter": {"query": "@event_type:4670"}, "name": "Permission Change"},
                {"filter": {"query": "@event_type:4658"}, "name": "Handle Close"},
                {"filter": {"query": "@event_type:5140"}, "name": "Share Access"},
                {"filter": {"query": "@event_type:5145"}, "name": "Share Check"},
                {"filter": {"query": "@event_type:4624"}, "name": "Logon"},
                {"filter": {"query": "@event_type:4634"}, "name": "Logoff"},
            ]
        },
        {
            "type": "status-remapper",
            "name": "Map result to log status",
            "is_enabled": True,
            "sources": ["result"]
        },
        {
            "type": "date-remapper",
            "name": "Use event timestamp",
            "is_enabled": True,
            "sources": ["timestamp"]
        },
        {
            "type": "attribute-remapper",
            "name": "Map user to usr.id",
            "is_enabled": True,
            "sources": ["user"],
            "source_type": "attribute",
            "target": "usr.id",
            "target_type": "attribute",
            "preserve_source": True,
            "override_on_conflict": False
        },
        {
            "type": "attribute-remapper",
            "name": "Map client_ip to network.client.ip",
            "is_enabled": True,
            "sources": ["client_ip"],
            "source_type": "attribute",
            "target": "network.client.ip",
            "target_type": "attribute",
            "preserve_source": True,
            "override_on_conflict": False
        }
    ]
}

resp = http.request("POST",
    "https://api.ap1.datadoghq.com/api/v1/logs/config/pipelines",
    body=json.dumps(pipeline).encode("utf-8"),
    headers={
        "Content-Type": "application/json",
        "DD-API-KEY": api_key,
        "DD-APPLICATION-KEY": app_key
    })
print(f"HTTP {resp.status}: Pipeline ID = {json.loads(resp.data)['id']}")

Site note: Replace api.ap1.datadoghq.com with your Datadog site's API endpoint (e.g., api.datadoghq.com for US1, api.datadoghq.eu for EU1).

EventID Mapping Table

EventID	Operation Name	MITRE ATT&CK	Description
4663	Object Access	T1005 (Data from Local System)	File read/write operation
4656	Handle Request	T1005	File handle opened
4660	Object Delete	T1485 (Data Destruction)	File deleted
4670	Permission Change	T1222 (File Permissions Modification)	ACL/permission modified
4658	Handle Close	—	File handle closed (low signal)
5140	Share Access	T1021.002 (SMB/Windows Admin)	SMB share connected
5145	Share Check	T1135 (Network Share Discovery)	Share permission checked
4624	Logon	T1078 (Valid Accounts)	Authentication event
4634	Logoff	—	Session ended

Security Monitors

Monitor 1: Mass File Deletion

mass_delete = {
    "name": "[FSxN] Mass File Deletion Detected",
    "type": "log alert",
    "query": 'logs("source:fsxn @event_type:4660").index("*").rollup("count").by("@user").last("5m") > 50',
    "message": """## Mass File Deletion Alert

A user has triggered more than 50 file deletion events within 5 minutes.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Check affected paths: `source:fsxn @event_type:4660 @user:{{@user}}`
2. Verify if this is a scheduled cleanup or authorized bulk operation
3. Check the client IP for unexpected sources
4. Correlate with user's normal deletion patterns

@slack-storage-alerts""",
    "tags": ["source:fsxn", "team:storage", "severity:high"],
    "options": {
        "thresholds": {"critical": 50, "warning": 20},
        "notify_no_data": False,
        "renotify_interval": 60,
        "evaluation_delay": 60
    }
}

Why 50? In our test environment, normal daily deletions per user stay under 10. The threshold should be tuned based on your environment's baseline — run source:fsxn @event_type:4660 | stats count by @user over a week to establish what's normal.

Monitor 2: Abnormal Access Volume

abnormal_access = {
    "name": "[FSxN] Abnormal Access Volume",
    "type": "log alert",
    "query": 'logs("source:fsxn @result:\\"Audit Success\\"").index("*").rollup("count").by("@user").last("1h") > 1000',
    "message": """## Abnormal Access Volume Alert

More than 1000 successful file access events in 1 hour.
May indicate data exfiltration or unauthorized bulk access.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Review patterns: `source:fsxn @user:{{@user}}`
2. Check if backup jobs or batch operations are running
3. Verify client IP and compare with known locations
4. Check accessed paths for sensitive content

@slack-security-alerts""",
    "options": {
        "thresholds": {"critical": 1000, "warning": 500},
        "evaluation_delay": 60
    }
}

Tuning note: Backup service accounts (svc-backup, svc-indexer) generate legitimate high-volume access. Exclude them with @user:-svc-* in the query or create a suppression rule.

Monitor 3: Access Failure Spike

failure_spike = {
    "name": "[FSxN] Access Failure Spike",
    "type": "log alert",
    "query": 'logs("source:fsxn @result:\\"Audit Failure\\"").index("*").rollup("count").by("@user").last("15m") > 10',
    "message": """## Access Failure Spike

More than 10 access failures in 15 minutes.
May indicate unauthorized access attempts or permission misconfiguration.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Check failed paths: `source:fsxn @result:"Audit Failure" @user:{{@user}}`
2. Verify if permissions were recently changed
3. Check if user is accessing resources outside their scope
4. Correlate with AD group membership changes

@slack-security-alerts""",
    "options": {
        "thresholds": {"critical": 10, "warning": 5},
        "evaluation_delay": 60
    }
}

Saved Views

Five pre-configured views for common investigation patterns:

View	Query	When to use
FSxN File Deletions	`source:fsxn @event_type:4660`	Monitor triggered, investigating which files
FSxN Access Failures	`source:fsxn @result:"Audit Failure"`	Permission denied investigation
FSxN All Events	`source:fsxn`	General audit stream
FSxN Sensitive Share Access	`source:fsxn (@path:finance OR @path:hr OR @path:legal)`	Sensitive data access review
FSxN After-Hours Access	`source:fsxn` (filtered by time)	Off-hours activity detection

Each view includes pre-configured columns (user, path, client_ip, event_type) so analysts don't need to customize the table every time.

Facets for One-Click Filtering

Facets add clickable filters to the left sidebar. Create them from any log entry's detail panel:

Facet	Field	Why
Event Type	`@event_type`	Filter to specific operations (4660=delete, 4663=access)
Operation	`@operation_name`	Human-readable version (after pipeline applies)
User	`@user`	Filter by specific user under investigation
SVM	`@svm`	Isolate events to specific storage virtual machines
File Path	`@path`	Filter by directory or share
Client IP	`@client_ip`	Filter by source workstation/server
Result	`@result`	Quick split between success and failure events

Note: Fields are searchable with @field:value syntax even without Facets. Facets add UI convenience — they're not required for alerting or saved views.

Cost Reality

Component	Monthly Cost	Notes
Lambda (5-min schedule, ~1s execution)	~$0.50	8,640 invocations/month
Secrets Manager (2 secrets)	~$0.80	API key + APP key
EventBridge Scheduler	~$0.00	Free tier covers this
S3 AP reads	~$0.05	Depends on file count
Datadog Forwarder Lambda	~$0.10	CloudTrail event processing
S3 Archive storage	~$0.02	Glacier tier after 30 days
AWS total	~$1.50/month
Datadog log ingestion	~$0.10/GB	Conventional pricing
Total (1GB/month)	~$1.60/month

Datadog pricing caveat: Log ingestion pricing varies by plan (Conventional vs On-Demand) and contract. The ~$0.10/GB is the list price for conventional plans; actual cost depends on your contract, committed volume, and retention choices.

Cloud SIEM pricing: Cloud SIEM is a separate paid feature (~$0.20/GB analyzed for log detection). A 30-day free trial is available. Security Signals, Detection Rules, and the triage workflow require Cloud SIEM. Log Monitors (threshold alerts to Slack) work without it.

Cost optimization patterns

Filter before shipping — Drop low-value EventIDs (4658 Handle Close) at the Lambda level to reduce ingestion volume
Compress payloads — Enable gzip on the HTTP POST to reduce egress
Tune audit scope — Configure ONTAP vserver audit to monitor only the shares that matter
Index retention — Use Datadog's flexible retention (15/30/60/90 days) to match compliance needs vs cost

E2E Verification Results

Verified on Datadog AP1 (ap1.datadoghq.com) with paid plan, June 2026:

Component	Status
Lambda → Datadog Logs API v2	✅ HTTP 202
Log Pipeline (5 processors)	✅ Applied automatically
Category Processor (9 EventIDs)	✅ `@operation_name` populated
Status Remapper	✅ Failure events marked as error
Date Remapper	✅ Event timestamp used (not ingest time)
Attribute Remapper (usr.id)	✅ Enables Datadog user features
Mass Delete Monitor	✅ ALERT triggered (55 events > threshold 50)
Access Failure Monitor	✅ ALERT triggered (12 events > threshold 10)
Abnormal Access Monitor	✅ Active (OK — high threshold by design)
Anomaly Monitor (ML)	✅ Active (learning baseline)
Saved Views (5)	✅ Accessible from Views dropdown
Facets (8 custom)	✅ Created via UI
Dashboard (10 widgets)	✅ FSx ONTAP Audit Log Overview
Log-based Metrics (4)	✅ `fsxn.audit.*` in Metrics Explorer
Sensitive Data Scanner (5 rules)	✅ PII detection active
Cloud SIEM Detection Rules (4)	✅ Security Signals generated
CloudTrail Forwarder	✅ Logs arriving in Datadog
GeoIP Enrichment	✅ Country/city populated on client_ip
GuardDuty Correlation Rule	✅ Cross-service detection active
Workflow Automation	✅ Monitor → Slack + Case
Case Management (FSXN project)	✅ Project + template case
SOC Triage Runbook	✅ 7-step Notebook
Log Archive (S3 → Glacier)	✅ source:fsxn archived
Snapshot Remediation Lambda	✅ Deployed (fsxn-snapshot-remediation)

Log-based Metrics

Log-based metrics extract numerical values from logs at ingest time — you get metric-resolution dashboards and anomaly detection without paying for log retention beyond the evaluation window.

# Created via Datadog API (no UI needed)
metrics = [
    "fsxn.audit.delete_count",         # Delete events grouped by user/SVM
    "fsxn.audit.access_failure_count",  # Failures by user/SVM/client_ip
    "fsxn.audit.event_count",           # All events by event_type/SVM
    "fsxn.audit.unique_users",          # Active users
]

Use cases:

Anomaly detection on fsxn.audit.delete_count — Datadog's built-in anomaly monitor catches unusual deletion patterns without manual thresholds
SLO tracking — Define an SLO on access failure rate (fsxn.audit.access_failure_count / fsxn.audit.event_count)
Cost efficiency — Metrics are retained for 15 months at metric resolution (vs log retention at full-text cost)

Sensitive Data Scanner

Audit logs frequently contain PII in file paths (/hr/EMP-123456-performance-review.xlsx) and user contexts. Datadog's Sensitive Data Scanner catches these before they reach analysts.

Rule	What It Catches	Why It Matters
Employee ID (`EMP-\d{6}`)	Employee records in paths	PII exposure prevention
JP Phone (`0[789]0-\d{4}-\d{4}`)	Mobile numbers in filenames	Privacy protection
Email in Path	Customer emails in file names	Data minimization
Credit Card	CC numbers in documents	PCI-DSS technical control
My Number	Japanese national ID	Sensitive PII detection

Compliance note: Sensitive Data Scanner is a technical detection and redaction control. It does not constitute full APPI/GDPR/PCI-DSS compliance, which requires organizational measures (consent management, purpose limitation, data subject rights, etc.). Consult your legal/compliance team for regulatory assessment.

Matched patterns are partially redacted at ingest — the first 3 characters remain for identification, the rest is replaced with [REDACTED]. This preserves investigation capability while protecting the data subject.

Enhanced Dashboard

The SOC-focused dashboard provides at-a-glance visibility into:

Who is generating the most activity (top users, top IPs)
What operations dominate (sunburst by operation/SVM)
Where access is concentrated (hot paths)
When failures spike (timeline with per-user breakdown)
Trend from log-based metrics (delete rate over time)

Cloud SIEM Security Signals

Beyond log monitors (which notify via Slack/PagerDuty), Security Signals integrate with Datadog's SOC workflow — triage, investigation, and response all in one place.

Log Monitors vs Cloud SIEM Detection Rules

Aspect	Log Monitors	Cloud SIEM Detection Rules
Output	Alert notification (Slack/PagerDuty/email)	Security Signal (appears in Security Signals panel)
Triage	Manual (click alert → investigate)	Built-in triage workflow (Open → In Progress → Archived)
MITRE mapping	Manual tags	Native MITRE ATT&CK framework integration
Correlation	Single query	Cross-log-source correlation possible
Case integration	None	Direct "Create Case" from signal
Best for	Operational alerts (DevOps/SRE)	Security investigation (SOC/IR teams)

Recommendation: Use both. Log Monitors for immediate Slack notification to storage team. Cloud SIEM Detection Rules for SOC triage workflow and MITRE-mapped investigation.

Three detection rules generate Security Signals when FSxN audit logs match threat patterns:

Rule	MITRE ATT&CK	Trigger	Severity
Mass File Deletion	T1485 Data Destruction	>50 deletes/5min per user	Critical
Brute Force File Access	T1110 Brute Force	>20 failures/15min per user+IP	High
Permission Tampering	T1222 File Permissions Modification	>5 changes/10min per user	High

# Creating a Security Signal rule via API
rule = {
    "name": "FSxN: Mass File Deletion",
    "type": "log_detection",
    "queries": [{
        "query": "source:fsxn @event_type:4660",
        "groupByFields": ["@user"],
        "aggregation": "count",
        "name": "delete_events",
        "dataSource": "logs"
    }],
    "cases": [
        {"condition": "delete_events > 50", "status": "critical", "name": "Critical"},
        {"condition": "delete_events > 20", "status": "high", "name": "High"}
    ],
    "options": {
        "evaluationWindow": 300,
        "keepAlive": 3600,
        "maxSignalDuration": 86400,
        "detectionMethod": "threshold"
    },
    "message": "User {{@user}} deleted {{value}} files in 5 min. T1485.",
    "tags": ["source:fsxn", "technique:T1485-data-destruction"]
}

Each signal includes investigation steps and response guidance directly in the signal panel — analysts don't need to leave the Security Signals view to understand what happened.

Cloud SIEM Setup Steps

Navigate to Security → Cloud SIEM → Get Started
Skip Content Packs (FSxN is a custom source)
Select fsxn as a log source → Enable as Trial
Configure Cloud SIEM Index (default: 450 days retention)
Detection rules are automatically applied to incoming source:fsxn logs

Workflow Automation

When a critical monitor fires, the Workflow automatically triggers response actions:

Monitor Alert → @workflow-fsxn-security-alert-response → Slack notification + Investigation links

The workflow (fsxn-security-alert-response) is linked to monitors via the @workflow-<handle> mention syntax in the monitor message. No additional configuration needed — just add the mention to any monitor's notification body.

Extensions (via Datadog Workflow builder):

Auto-create Jira ticket with investigation context
Query Active Directory for user's manager
Invoke fsxn-snapshot-remediation Lambda for evidence preservation (deployed — see Automated Snapshot Remediation)

Investigation Notebook

A pre-built 5-step investigation template guides analysts through alert triage:

Step	Content	Widget
1. Identify Scope	Event type distribution timeline	Timeseries (bars)
2. User Timeline	Delete events by user over time	Timeseries (line)
3. Affected Files	Top 20 deleted paths	Top List
4. Client IP Analysis	Top 10 source IPs	Top List
5. Conclusion	Root cause, impact, action template	Markdown

Access at: Notebooks → "FSxN Audit Log Investigation Template"

Each step includes pre-configured queries — analysts just adjust the time window and user filter to match the alert context.

RBAC and Access Control

The FSxN Security Analyst role provides scoped access to audit logs:

Role	Permissions	Use Case
FSxN Security Analyst	`logs_read_data`, `logs_read_index_data`	SOC analysts investigating file access
Datadog Standard Role	Full access	Storage admins managing pipeline

Assign users to this role to grant log access without exposing infrastructure metrics or APM data. Combined with Saved Views, analysts see only the investigation tools relevant to their role.

OTel Collector Bridge

For teams already running OpenTelemetry Collector, an alternative delivery path routes FSxN logs through OTel with trace context injection:

# otel-bridge/collector-config.yaml (excerpt)
receivers:
  otlp:
    protocols:
      http: { endpoint: 0.0.0.0:4318 }

processors:
  resource:
    attributes:
      - key: service.name
        value: ontap-audit
        action: upsert
  transform:
    log_statements:
      - context: log
        statements:
          - set(attributes["ddsource"], "fsxn")

exporters:
  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: ${env:DD_SITE:-ap1.datadoghq.com}

Benefits over direct Lambda→Datadog:

trace_id injection for distributed tracing correlation
Multi-backend fanout (Datadog + Grafana + S3 in parallel)
Attribute enrichment at collector level (team, environment tags)
Sampling for high-volume environments
Edge-side PII redaction — Mask sensitive fields before they leave your AWS account (stronger than relying solely on Datadog's Sensitive Data Scanner)

Edge redaction pattern: Use OTel's transform processor to hash user and client_ip and truncate path to directory level BEFORE export to Datadog. This ensures PII never crosses your network boundary — even if Datadog's SDS configuration is misconfigured or disabled, your data is already protected at the source.

Status: Config syntax verified. Integration testing requires deploying OTel Collector on ECS (see Part 7 for the ECS-based Collector pattern used with Grafana and Honeycomb).

Log Archives and Compliance

For regulatory retention (adjust based on your organization's audit policy), a dedicated CloudFormation template deploys an S3 archive with Glacier lifecycle:

aws cloudformation deploy \
  --template-file integrations/datadog/template-log-archive.yaml \
  --stack-name fsxn-datadog-archive \
  --parameter-overrides \
    DatadogExternalId=<from-datadog-integration-page> \
    RetentionDays=30 \
    GlacierRetentionDays=2555 \
  --capabilities CAPABILITY_NAMED_IAM

This separates concerns:

Detection — Hot logs in Datadog (15-30 day index retention)
Compliance — Cold archive in S3 → Glacier (7+ years)
Investigation — Rehydrate specific time ranges on demand

Anomaly Detection (ML-based)

Beyond static thresholds, the anomaly monitor on fsxn.audit.delete_count uses Datadog's agile algorithm to learn each user's baseline:

# Anomaly monitor — no manual threshold needed
"avg(last_4h):anomalies(
    sum:fsxn.audit.delete_count{*} by {user}.as_count(),
    'agile', 3, direction='above'
) >= 1"

When a user who normally deletes 5 files/day suddenly deletes 200, the alert fires — even though 200 is well below the static "50 in 5 minutes" threshold. This catches slow, sustained exfiltration that threshold-based monitors miss.

Baseline period: The anomaly algorithm needs ~2 weeks of data to build confidence. Deploy early — it stays silent during the learning period.

Cardinality Management

⚠️ Log-based metrics with group_by: user create one time series per unique user. For organizations with 10,000+ users:

Metric	Recommended group_by	Reason
`fsxn.audit.event_count`	svm only	Broad metric, low cardinality
`fsxn.audit.delete_count`	user, svm	Targeted, high signal
`fsxn.audit.access_failure_count`	user, svm, client_ip	Investigation-focused
`fsxn.audit.unique_users`	user	Intentionally per-user

Monitor cardinality in Metrics Summary (fsxn.audit.*). Datadog bills per unique tag-value combination — keep per-user grouping only on metrics where individual user behavior matters.

What's Different from Splunk/CrowdStrike?

Aspect	Datadog	Splunk (Part 8)	CrowdStrike (Part 15)
Delivery protocol	Logs API v2 (JSON)	HEC (JSON)	HEC (JSON)
Pipeline config	API-driven (Python)	props.conf / UI	LogScale parser YAML
Monitor creation	API (`POST /api/v1/monitor`)	Saved searches + alerts	CQL + alert actions
EventID mapping	Category Processor	eval/lookup	LogScale parser
Identity correlation	`usr.id` attribute	`user` field	Falcon Identity
Free tier	❌ (paid only for logs)	❌	❌
API management	✅ Full (Pipeline + Monitor + Dashboard)	Partial	Limited

The key Datadog advantage: everything is API-manageable. Pipeline, monitors, dashboards, and saved views can all be created, updated, and version-controlled through the Datadog API — making infrastructure-as-code for observability fully achievable.

Production Recommendations

Use the API, not the UI — Version-control your Pipeline, Monitor, and Dashboard definitions in a setup script. When you need to adjust thresholds or add EventIDs, it's a code change with review.
Separate APP key from API key — The API key ships logs; the APP key manages configuration. Store both in Secrets Manager with different IAM policies.
Exclude service accounts — Add @user:-svc-* to monitor queries or create Datadog suppression rules for known batch accounts.
Start with Warning, promote to Critical — Deploy monitors with Warning thresholds first, observe for 1-2 weeks, then tighten to Critical after confirming the baseline.
Connect to incident workflow — Route Critical monitors to PagerDuty/Slack and include investigation links (Saved View URLs) in the alert message.
Enable Log Archives for compliance — Configure S3 archive with Glacier lifecycle for FISC/SOC2 retention requirements. Rehydrate on demand for historical investigation.
Use Workflow Automation for response — Configure Datadog Workflows to automatically create Jira tickets, trigger Slack quick-actions, or invoke Lambda remediation when critical monitors fire.
Manage with Terraform for production — For enterprise/multi-org deployments, use the Datadog Terraform Provider to version-control Pipelines, Monitors, and Detection Rules. Separate API key (logs_write only) from APP key (admin-level, CI/CD pipeline only).
Apply IAM Permissions Boundary — In large organizations, apply a Permissions Boundary to the log-shipping Lambda role to prevent privilege escalation. The boundary should allow only s3:GetObject, s3:ListBucket, secretsmanager:GetSecretValue, and logs:* (CloudWatch).
Encrypt with customer-managed keys (CMK) — For regulated environments, enable Datadog Log Management Encryption with your KMS CMK, and configure the S3 archive bucket with SSE-KMS. This ensures audit logs are encrypted with keys you control at every stage.
Tune detection rules (anti-alert-fatigue) — Run all new Cloud SIEM rules in "Warning / dev" status for 2 weeks. Whitelist known service accounts with @usr.id:-svc-* in queries. Review false positives weekly and adjust thresholds or add suppression rules before promoting to Critical.
Data Integrity: Handle HTTP 429/5xx — Datadog Logs API v2 returns HTTP 202 (accepted, not indexed). On HTTP 429 (rate limit) or 5xx (transient), implement exponential backoff (base 1s, max 5 retries). After final failure, route the complete HEC payload to SQS DLQ for replay. Never drop logs silently.

Deployment

# 1. Deploy the CloudFormation stack (Lambda + Scheduler + DLQ + Alarms)
aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=arn:aws:s3:<region>:<account>:accesspoint/<name> \
    DatadogApiKeySecretArn=<secret-arn> \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM

# 2. Deploy full observability (Pipeline + Monitors + Metrics + Scanner)
export DD_API_KEY_SECRET_ID="fsxn-datadog-api-key"
export DD_APP_KEY_SECRET_ID="datadog/fsxn-app-key"
export DD_SITE="ap1.datadoghq.com"
bash integrations/datadog/scripts/setup-full-observability.sh

# 3. (Optional) Deploy Log Archive for compliance
aws cloudformation deploy \
  --template-file integrations/datadog/template-log-archive.yaml \
  --stack-name fsxn-datadog-archive \
  --parameter-overrides DatadogExternalId=<external-id> \
  --capabilities CAPABILITY_NAMED_IAM

# 4. Create Facets (guided manual step)
bash integrations/datadog/scripts/setup-facets.sh

# 5. Enable Cloud SIEM (UI: Security → Cloud SIEM → Get Started → select fsxn)

Time from zero to full detection capability: ~45 minutes.

Pre-Deployment Checklist

Before deploying to production, confirm:

[ ] FSx audit configuration confirmed (XML format, rotation schedule)
[ ] S3 Access Point created and Lambda role authorized
[ ] Datadog API Key + APP Key stored in Secrets Manager
[ ] Datadog site region confirmed (AP1/US1/EU1)
[ ] Data classification sign-off: audit logs contain user PII (usernames, IPs, file paths) — confirm external transmission is approved per organization policy
[ ] Retention requirements defined (Datadog index retention + S3 archive lifecycle)
[ ] On-call/escalation path defined for Critical signals
[ ] Service accounts identified for monitor exclusion (svc-*)
[ ] Cloud SIEM trial or paid plan confirmed (for Security Signals)
[ ] Network path validated (VPC-external Lambda or NAT Gateway for S3 AP access)

Lessons Learned

1. API-first beats UI-first

Creating the Pipeline via API took 30 seconds. Doing it through the UI takes 5 minutes of clicking through dropdowns. More importantly, the API approach is repeatable, reviewable, and version-controlled.

2. Facets are UX, not functionality

Every field is searchable with @field:value whether or not you create a Facet. Facets just add sidebar convenience. Don't block on Facet setup — your monitors work without them.

3. Category Processor is the key transformation

Without it, analysts need to memorize Windows EventID codes. With it, @operation_name:Object Delete is immediately understandable. This single processor provides more investigative value than any other pipeline step.

4. Evaluation delay prevents false positives

The 60-second evaluation_delay on monitors accounts for ingestion latency. Without it, you get spurious alerts when log batches arrive slightly out of order.

5. Saved Views are investigation shortcuts

When a monitor fires at 3 AM, the on-call engineer needs to investigate immediately. Pre-built Saved Views with the right columns and filters eliminate setup time during incidents.

6. Anomaly detection needs baseline data

The anomaly monitor requires at least 2 weeks of historical data to build a reliable baseline. Deploy it early — it won't generate false positives during the learning period (it simply stays silent until confident).

7. Cloud SIEM and Log Monitors are complementary

Log Monitors give you fast Slack/PagerDuty notifications for the on-call team. Cloud SIEM Detection Rules give SOC analysts a structured triage workflow with MITRE mapping. Deploy both — they serve different audiences.

8. GeoIP is free threat intelligence

The GeoIP Processor is zero-cost and instantly useful. Even in a pure on-premises environment, unusual geography flags VPN misconfiguration, compromised credentials, or travel you weren't expecting.

9. Forwarder deployment enables an ecosystem

Once the Datadog Forwarder Lambda is deployed, enabling CloudTrail, GuardDuty, Lambda logs, or any other AWS log source is a single S3 notification configuration — no new infrastructure needed.

CloudTrail Correlation (Cross-Service Detection)

A dedicated detection rule correlates FSxN audit events with AWS CloudTrail context. When the signal fires, the investigation message includes pre-built queries for cross-service correlation:

source:cloudtrail @userIdentity.arn:*<username>*     → IAM actions
source:guardduty                                      → GuardDuty findings
source:vpc @network.client.ip:<suspicious-ip>         → Network flows

The rule (FSxN: Suspicious Deletion After IAM Role Assumption) fires on >30 deletions in 10 minutes and embeds a correlation checklist:

Was there a recent AssumeRole or ConsoleLogin from an unusual IP?
Did the user's permissions change in the last 24 hours?
Is the client IP in the corporate VPN range?
Are there GuardDuty findings for this account?

Note: This rule activates fully once CloudTrail logs are flowing to Datadog via the AWS Integration. The FSxN detection works immediately; CloudTrail queries return results after integration setup.

SOC Triage Runbook

A 7-step Notebook guides analysts from signal to resolution:

Step	Action	Output
1. Signal Context	Verify user, IP, time, volume	Timeseries widget
2. User Verification	Check AD status, role changes, maintenance windows	Manual checklist
3. Impact Assessment	Identify affected files and directories	Top List (paths)
4. Cross-Service Correlation	CloudTrail, GuardDuty, VPC Flow Logs	Query templates
5. Decision Matrix	Map conditions to response actions	Decision table
6. Response Actions	Disable account, snapshot, notify, ticket	Checklist
7. Post-Incident	Update thresholds, document, lessons learned	Checklist

Accessible at: Notebooks → "FSxN Security Signal Triage Runbook"

Case Management

Cases provide structured investigation tracking across the SOC team:

Project: FSXN — All FSx for ONTAP security investigations
Case creation: Manual from Security Signals, or auto-create via Workflow
Priority levels: P1 (critical mass deletion) through P4 (informational)
Lifecycle: Open → In Progress → Resolved → Closed

When a critical signal fires, the workflow creates a case automatically with:

Signal link and context
Affected user and IP
Investigation notebook link
Response checklist

Threat Intelligence (GeoIP Enrichment)

The Log Pipeline's GeoIP Processor automatically enriches every FSxN log with geographic data from the client_ip field:

@client_ip: 10.0.5.99 → @network.client.geoip: { country: "JP", city: "Tokyo", ... }

A detection rule flags access from unexpected countries:

# Rule: "FSxN: File Access from Unusual Geography"
# Query: source:fsxn -@network.client.geoip.country:JP -@network.client.geoip.country:US
# Threshold: >5 events/15min from non-JP/US IPs

This catches compromised credentials being used from abroad without needing an external threat intelligence feed — Datadog's built-in GeoIP database handles the enrichment at ingest time.

GuardDuty Correlation

A detection rule correlates FSxN deletion events with GuardDuty findings from the same source IP:

FSxN: >10 deletions/30min from IP X
  + GuardDuty finding for IP X (UnauthorizedAccess, Recon, Trojan)
  = Critical Security Signal with full context

The rule's investigation message includes pre-built GuardDuty queries:

source:guardduty @detail.resource.instanceDetails.networkInterfaces.privateIpAddress:{{@client_ip}}

GuardDuty findings flow to Datadog via the same Forwarder Lambda that handles CloudTrail — no additional setup needed.

Automated Snapshot Remediation

When mass deletion is confirmed (Critical signal reviewed by analyst), the Workflow invokes fsxn-snapshot-remediation Lambda:

Security Signal (Critical) → Analyst confirms → Workflow → Lambda → ONTAP REST API → Snapshot

The Lambda:

Receives volume name, SVM, and reason from the Workflow
Authenticates to ONTAP via Secrets Manager credentials
Creates a timestamped snapshot: remediation_20260614_215530_mass_deletion
Returns snapshot name and status for the Case record

# Lambda invocation payload (from Datadog Workflow)
{
    "volume_name": "finance_share",
    "svm_name": "ProductionSVM",
    "reason": "Mass deletion detected",
    "user": "CORP\\suspicious-user"
}

Configuration: Set ONTAP_MGMT_IP and ONTAP_CREDENTIALS_SECRET_ARN environment variables on the Lambda to point to your FSx for ONTAP management endpoint.

TLS note: The Lambda uses cert_reqs="CERT_NONE" for ONTAP REST API calls because FSx for ONTAP uses self-signed certificates by default. In production, upload the ONTAP CA certificate to the Lambda layer (/opt/certs/ontap-ca.pem) and configure urllib3.PoolManager(ca_certs="/opt/certs/ontap-ca.pem") to validate the connection.

Snapshot Storm Prevention (Cooldown)

To prevent runaway Snapshot creation during sustained mass-deletion events:

# Before creating snapshot, check for recent remediation snapshots
existing = http.request("GET",
    f"https://{mgmt_ip}/api/storage/volumes/{vol_uuid}/snapshots"
    f"?name=remediation_*&order_by=create_time desc&max_records=1",
    headers=base_headers)

if existing.status == 200:
    snaps = json.loads(existing.data).get("records", [])
    if snaps:
        last_snap_time = snaps[0].get("create_time", "")
        # Skip if a remediation snapshot was created in the last 15 minutes
        if is_within_cooldown(last_snap_time, minutes=15):
            return {"statusCode": 200, "body": "Skipped — cooldown active"}

ONTAP limits: Each volume supports up to 1023 Snapshots. The cooldown prevents hitting this limit during sustained attack scenarios.

Remediation Audit Trail

The Lambda invocation itself must be auditable — proving that the Snapshot was created by an authorized pipeline, not a rogue actor:

CloudTrail: Lambda invocation recorded with InvokedBy: workflow.datadoghq.com
ONTAP audit log: Snapshot creation appears as an administrative event with the API user
Datadog Case: Snapshot name and status recorded in the Case timeline
Lambda CloudWatch Logs: Full request/response logged with correlation ID

What's Next

This article covered the complete detection-to-response lifecycle. Future enhancements:

Datadog Threat Intel feeds — When Datadog makes the Threat Intel Indicators API available on AP1, feed internal IP reputation lists for richer enrichment
Multi-SVM snapshot orchestration — Extend the remediation Lambda to snapshot across multiple SVMs in parallel
Automated account lockout — Chain the Workflow to invoke AD lockout via Systems Manager Run Command

GitHub Repository