TL;DR
For security teams: Four threshold monitors + one ML anomaly monitor + four Cloud SIEM detection rules catch mass file deletion, data exfiltration, permission tampering, and unusual geography in under 5 minutes. EventID codes become human-readable operation names. All created via Datadog API — no manual clicking required.
For engineers: The Log Pipeline (6 processors including GeoIP enrichment) transforms raw ONTAP XML events into searchable, alertable fields. Five Saved Views cover the most common investigation patterns. Total AWS cost remains ~$1.50/month.
For architects: This pattern demonstrates the full detection-to-response lifecycle — from raw file system audit events through enrichment, categorization, alerting, cross-service correlation, and automated remediation — entirely serverless, entirely as code.
FSx for ONTAP → S3 Access Point → Lambda → Datadog Logs API v2
│
▼
┌──────────────────────┐
│ Log Pipeline (6) │
│ • Category Processor│
│ • Status Remapper │
│ • Date Remapper │
│ • Attribute Remapper│
│ • GeoIP Enrichment │
└──────────┬───────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Log Monitors (4)│ │ Cloud SIEM (4) │ │ Log Archive │
│ + Anomaly (1) │ │ Security Signals│ │ → S3 → Glacier │
└────────┬────────┘ └────────┬────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────┐
│ Workflow → Slack + Case + Lambda │
│ (ONTAP Snapshot remediation)│
└──────────────────────────────────────┘
Reading guide: Sections 1-9 cover the core pipeline (30 min to deploy). Sections 10+ cover advanced SOC integration (additional 15 min). Jump to Deployment if you want to start immediately.
Business value: This pipeline reduces mean-time-to-detect (MTTD) for file-based threats from "hours/days" (batch log review) to under 5 minutes (real-time alerting). For regulated environments, it provides continuous compliance evidence with automated archival — replacing manual quarterly reviews with always-on monitoring.
This is Part 16 of Serverless Observability for FSx for ONTAP.
Why Post-Ingestion Processing Matters
In Part 1, we shipped raw audit events to Datadog. That's necessary but insufficient. Raw events look like this:
{
"event_type": "4660",
"user": "CORP\\contractor-ext-03",
"path": "/share/engineering/designs/prototype-v3.dwg",
"result": "Audit Failure"
}
Questions that raw logs can't answer quickly:
- What is EventID 4660? (Answer: Object Delete)
- Should this alarm? (Answer: depends on volume and context)
- Who should investigate? (Answer: Storage team + SOC)
- What's the user's baseline? (Answer: need faceted historical view)
This article builds the processing layers that transform raw data into actionable security intelligence.
Architecture
Why EventBridge Scheduler? FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge object-level events. Lambda polls on a schedule and uses checkpointing to process only new files. XML format is chosen for Lambda-native parsing without binary dependencies (vs EVTX which requires Windows-specific libraries).
┌──────────────────────────────────────────────────────────────────┐
│ Lambda Handler (Python 3.12) │
│ • Parse XML → Normalize → HEC-compatible JSON │
│ • POST to Datadog Logs API v2 │
│ • Fields: event_type, user, path, client_ip, svm, result │
└──────────────────────────────────────────────────────────────────┘
│ HTTP 202
▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Log Pipeline: "FSx for ONTAP Audit Logs" │
│ Filter: source:fsxn │
│ │
│ 1. Category Processor → @operation_name │
│ 4663→Object Access, 4660→Object Delete, 4656→Handle Request │
│ │
│ 2. Status Remapper → log severity from @result │
│ "Audit Success" → info, "Audit Failure" → error │
│ │
│ 3. Date Remapper → @timestamp as official log time │
│ │
│ 4. Attribute Remapper → @user→usr.id, @client_ip→network.client │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Security Monitors (3) │
│ │
│ • [FSxN] Mass File Deletion → >50 deletes/5min per user │
│ • [FSxN] Abnormal Access Volume → >1000 accesses/1h per user │
│ • [FSxN] Access Failure Spike → >10 failures/15min per user │
└──────────────────────────────────────────────────────────────────┘
The Log Pipeline
Why a Pipeline?
Without a pipeline, every query requires remembering that 4660 means "delete" and 4663 means "access." With a pipeline, you search @operation_name:Object Delete — and Datadog handles the translation at ingest time.
Creating the Pipeline via API
import json, urllib3, boto3
sm = boto3.client('secretsmanager', region_name='ap-northeast-1')
api_key = sm.get_secret_value(SecretId='fsxn-datadog-api-key')['SecretString']
app_key = sm.get_secret_value(SecretId='datadog/fsxn-app-key')['SecretString']
http = urllib3.PoolManager()
pipeline = {
"name": "FSx for ONTAP Audit Logs",
"is_enabled": True,
"filter": {"query": "source:fsxn"},
"processors": [
{
"type": "category-processor",
"name": "EventID to Operation Name",
"is_enabled": True,
"target": "operation_name",
"categories": [
{"filter": {"query": "@event_type:4663"}, "name": "Object Access"},
{"filter": {"query": "@event_type:4656"}, "name": "Handle Request"},
{"filter": {"query": "@event_type:4660"}, "name": "Object Delete"},
{"filter": {"query": "@event_type:4670"}, "name": "Permission Change"},
{"filter": {"query": "@event_type:4658"}, "name": "Handle Close"},
{"filter": {"query": "@event_type:5140"}, "name": "Share Access"},
{"filter": {"query": "@event_type:5145"}, "name": "Share Check"},
{"filter": {"query": "@event_type:4624"}, "name": "Logon"},
{"filter": {"query": "@event_type:4634"}, "name": "Logoff"},
]
},
{
"type": "status-remapper",
"name": "Map result to log status",
"is_enabled": True,
"sources": ["result"]
},
{
"type": "date-remapper",
"name": "Use event timestamp",
"is_enabled": True,
"sources": ["timestamp"]
},
{
"type": "attribute-remapper",
"name": "Map user to usr.id",
"is_enabled": True,
"sources": ["user"],
"source_type": "attribute",
"target": "usr.id",
"target_type": "attribute",
"preserve_source": True,
"override_on_conflict": False
},
{
"type": "attribute-remapper",
"name": "Map client_ip to network.client.ip",
"is_enabled": True,
"sources": ["client_ip"],
"source_type": "attribute",
"target": "network.client.ip",
"target_type": "attribute",
"preserve_source": True,
"override_on_conflict": False
}
]
}
resp = http.request("POST",
"https://api.ap1.datadoghq.com/api/v1/logs/config/pipelines",
body=json.dumps(pipeline).encode("utf-8"),
headers={
"Content-Type": "application/json",
"DD-API-KEY": api_key,
"DD-APPLICATION-KEY": app_key
})
print(f"HTTP {resp.status}: Pipeline ID = {json.loads(resp.data)['id']}")
Site note: Replace
api.ap1.datadoghq.comwith your Datadog site's API endpoint (e.g.,api.datadoghq.comfor US1,api.datadoghq.eufor EU1).
EventID Mapping Table
| EventID | Operation Name | MITRE ATT&CK | Description |
|---|---|---|---|
| 4663 | Object Access | T1005 (Data from Local System) | File read/write operation |
| 4656 | Handle Request | T1005 | File handle opened |
| 4660 | Object Delete | T1485 (Data Destruction) | File deleted |
| 4670 | Permission Change | T1222 (File Permissions Modification) | ACL/permission modified |
| 4658 | Handle Close | — | File handle closed (low signal) |
| 5140 | Share Access | T1021.002 (SMB/Windows Admin) | SMB share connected |
| 5145 | Share Check | T1135 (Network Share Discovery) | Share permission checked |
| 4624 | Logon | T1078 (Valid Accounts) | Authentication event |
| 4634 | Logoff | — | Session ended |
Security Monitors
Monitor 1: Mass File Deletion
mass_delete = {
"name": "[FSxN] Mass File Deletion Detected",
"type": "log alert",
"query": 'logs("source:fsxn @event_type:4660").index("*").rollup("count").by("@user").last("5m") > 50',
"message": """## Mass File Deletion Alert
A user has triggered more than 50 file deletion events within 5 minutes.
**User**: {{@user}} | **Count**: {{value}}
### Investigation Steps
1. Check affected paths: `source:fsxn @event_type:4660 @user:{{@user}}`
2. Verify if this is a scheduled cleanup or authorized bulk operation
3. Check the client IP for unexpected sources
4. Correlate with user's normal deletion patterns
@slack-storage-alerts""",
"tags": ["source:fsxn", "team:storage", "severity:high"],
"options": {
"thresholds": {"critical": 50, "warning": 20},
"notify_no_data": False,
"renotify_interval": 60,
"evaluation_delay": 60
}
}
Why 50? In our test environment, normal daily deletions per user stay under 10. The threshold should be tuned based on your environment's baseline — run source:fsxn @event_type:4660 | stats count by @user over a week to establish what's normal.
Monitor 2: Abnormal Access Volume
abnormal_access = {
"name": "[FSxN] Abnormal Access Volume",
"type": "log alert",
"query": 'logs("source:fsxn @result:\\"Audit Success\\"").index("*").rollup("count").by("@user").last("1h") > 1000',
"message": """## Abnormal Access Volume Alert
More than 1000 successful file access events in 1 hour.
May indicate data exfiltration or unauthorized bulk access.
**User**: {{@user}} | **Count**: {{value}}
### Investigation Steps
1. Review patterns: `source:fsxn @user:{{@user}}`
2. Check if backup jobs or batch operations are running
3. Verify client IP and compare with known locations
4. Check accessed paths for sensitive content
@slack-security-alerts""",
"options": {
"thresholds": {"critical": 1000, "warning": 500},
"evaluation_delay": 60
}
}
Tuning note: Backup service accounts (svc-backup, svc-indexer) generate legitimate high-volume access. Exclude them with @user:-svc-* in the query or create a suppression rule.
Monitor 3: Access Failure Spike
failure_spike = {
"name": "[FSxN] Access Failure Spike",
"type": "log alert",
"query": 'logs("source:fsxn @result:\\"Audit Failure\\"").index("*").rollup("count").by("@user").last("15m") > 10',
"message": """## Access Failure Spike
More than 10 access failures in 15 minutes.
May indicate unauthorized access attempts or permission misconfiguration.
**User**: {{@user}} | **Count**: {{value}}
### Investigation Steps
1. Check failed paths: `source:fsxn @result:"Audit Failure" @user:{{@user}}`
2. Verify if permissions were recently changed
3. Check if user is accessing resources outside their scope
4. Correlate with AD group membership changes
@slack-security-alerts""",
"options": {
"thresholds": {"critical": 10, "warning": 5},
"evaluation_delay": 60
}
}
Saved Views
Five pre-configured views for common investigation patterns:
| View | Query | When to use |
|---|---|---|
| FSxN File Deletions | source:fsxn @event_type:4660 |
Monitor triggered, investigating which files |
| FSxN Access Failures | source:fsxn @result:"Audit Failure" |
Permission denied investigation |
| FSxN All Events | source:fsxn |
General audit stream |
| FSxN Sensitive Share Access | source:fsxn (@path:*finance* OR @path:*hr* OR @path:*legal*) |
Sensitive data access review |
| FSxN After-Hours Access |
source:fsxn (filtered by time) |
Off-hours activity detection |
Each view includes pre-configured columns (user, path, client_ip, event_type) so analysts don't need to customize the table every time.
Facets for One-Click Filtering
Facets add clickable filters to the left sidebar. Create them from any log entry's detail panel:
| Facet | Field | Why |
|---|---|---|
| Event Type | @event_type |
Filter to specific operations (4660=delete, 4663=access) |
| Operation | @operation_name |
Human-readable version (after pipeline applies) |
| User | @user |
Filter by specific user under investigation |
| SVM | @svm |
Isolate events to specific storage virtual machines |
| File Path | @path |
Filter by directory or share |
| Client IP | @client_ip |
Filter by source workstation/server |
| Result | @result |
Quick split between success and failure events |
Note: Fields are searchable with
@field:valuesyntax even without Facets. Facets add UI convenience — they're not required for alerting or saved views.
Cost Reality
| Component | Monthly Cost | Notes |
|---|---|---|
| Lambda (5-min schedule, ~1s execution) | ~$0.50 | 8,640 invocations/month |
| Secrets Manager (2 secrets) | ~$0.80 | API key + APP key |
| EventBridge Scheduler | ~$0.00 | Free tier covers this |
| S3 AP reads | ~$0.05 | Depends on file count |
| Datadog Forwarder Lambda | ~$0.10 | CloudTrail event processing |
| S3 Archive storage | ~$0.02 | Glacier tier after 30 days |
| AWS total | ~$1.50/month | |
| Datadog log ingestion | ~$0.10/GB | Conventional pricing |
| Total (1GB/month) | ~$1.60/month |
Datadog pricing caveat: Log ingestion pricing varies by plan (Conventional vs On-Demand) and contract. The ~$0.10/GB is the list price for conventional plans; actual cost depends on your contract, committed volume, and retention choices.
Cloud SIEM pricing: Cloud SIEM is a separate paid feature (~$0.20/GB analyzed for log detection). A 30-day free trial is available. Security Signals, Detection Rules, and the triage workflow require Cloud SIEM. Log Monitors (threshold alerts to Slack) work without it.
Cost optimization patterns
- Filter before shipping — Drop low-value EventIDs (4658 Handle Close) at the Lambda level to reduce ingestion volume
- Compress payloads — Enable gzip on the HTTP POST to reduce egress
-
Tune audit scope — Configure ONTAP
vserver auditto monitor only the shares that matter - Index retention — Use Datadog's flexible retention (15/30/60/90 days) to match compliance needs vs cost
E2E Verification Results
Verified on Datadog AP1 (ap1.datadoghq.com) with paid plan, June 2026:
| Component | Status |
|---|---|
| Lambda → Datadog Logs API v2 | ✅ HTTP 202 |
| Log Pipeline (5 processors) | ✅ Applied automatically |
| Category Processor (9 EventIDs) | ✅ @operation_name populated |
| Status Remapper | ✅ Failure events marked as error |
| Date Remapper | ✅ Event timestamp used (not ingest time) |
| Attribute Remapper (usr.id) | ✅ Enables Datadog user features |
| Mass Delete Monitor | ✅ ALERT triggered (55 events > threshold 50) |
| Access Failure Monitor | ✅ ALERT triggered (12 events > threshold 10) |
| Abnormal Access Monitor | ✅ Active (OK — high threshold by design) |
| Anomaly Monitor (ML) | ✅ Active (learning baseline) |
| Saved Views (5) | ✅ Accessible from Views dropdown |
| Facets (8 custom) | ✅ Created via UI |
| Dashboard (10 widgets) | ✅ FSx ONTAP Audit Log Overview |
| Log-based Metrics (4) | ✅ fsxn.audit.* in Metrics Explorer |
| Sensitive Data Scanner (5 rules) | ✅ PII detection active |
| Cloud SIEM Detection Rules (4) | ✅ Security Signals generated |
| CloudTrail Forwarder | ✅ Logs arriving in Datadog |
| GeoIP Enrichment | ✅ Country/city populated on client_ip |
| GuardDuty Correlation Rule | ✅ Cross-service detection active |
| Workflow Automation | ✅ Monitor → Slack + Case |
| Case Management (FSXN project) | ✅ Project + template case |
| SOC Triage Runbook | ✅ 7-step Notebook |
| Log Archive (S3 → Glacier) | ✅ source:fsxn archived |
| Snapshot Remediation Lambda | ✅ Deployed (fsxn-snapshot-remediation) |
Log-based Metrics
Log-based metrics extract numerical values from logs at ingest time — you get metric-resolution dashboards and anomaly detection without paying for log retention beyond the evaluation window.
# Created via Datadog API (no UI needed)
metrics = [
"fsxn.audit.delete_count", # Delete events grouped by user/SVM
"fsxn.audit.access_failure_count", # Failures by user/SVM/client_ip
"fsxn.audit.event_count", # All events by event_type/SVM
"fsxn.audit.unique_users", # Active users
]
Use cases:
-
Anomaly detection on
fsxn.audit.delete_count— Datadog's built-in anomaly monitor catches unusual deletion patterns without manual thresholds -
SLO tracking — Define an SLO on access failure rate (
fsxn.audit.access_failure_count / fsxn.audit.event_count) - Cost efficiency — Metrics are retained for 15 months at metric resolution (vs log retention at full-text cost)
Sensitive Data Scanner
Audit logs frequently contain PII in file paths (/hr/EMP-123456-performance-review.xlsx) and user contexts. Datadog's Sensitive Data Scanner catches these before they reach analysts.
| Rule | What It Catches | Why It Matters |
|---|---|---|
Employee ID (EMP-\d{6}) |
Employee records in paths | PII exposure prevention |
JP Phone (0[789]0-\d{4}-\d{4}) |
Mobile numbers in filenames | Privacy protection |
| Email in Path | Customer emails in file names | Data minimization |
| Credit Card | CC numbers in documents | PCI-DSS technical control |
| My Number | Japanese national ID | Sensitive PII detection |
Compliance note: Sensitive Data Scanner is a technical detection and redaction control. It does not constitute full APPI/GDPR/PCI-DSS compliance, which requires organizational measures (consent management, purpose limitation, data subject rights, etc.). Consult your legal/compliance team for regulatory assessment.
Matched patterns are partially redacted at ingest — the first 3 characters remain for identification, the rest is replaced with [REDACTED]. This preserves investigation capability while protecting the data subject.
Enhanced Dashboard
The SOC-focused dashboard provides at-a-glance visibility into:
- Who is generating the most activity (top users, top IPs)
- What operations dominate (sunburst by operation/SVM)
- Where access is concentrated (hot paths)
- When failures spike (timeline with per-user breakdown)
- Trend from log-based metrics (delete rate over time)
Cloud SIEM Security Signals
Beyond log monitors (which notify via Slack/PagerDuty), Security Signals integrate with Datadog's SOC workflow — triage, investigation, and response all in one place.
Log Monitors vs Cloud SIEM Detection Rules
| Aspect | Log Monitors | Cloud SIEM Detection Rules |
|---|---|---|
| Output | Alert notification (Slack/PagerDuty/email) | Security Signal (appears in Security Signals panel) |
| Triage | Manual (click alert → investigate) | Built-in triage workflow (Open → In Progress → Archived) |
| MITRE mapping | Manual tags | Native MITRE ATT&CK framework integration |
| Correlation | Single query | Cross-log-source correlation possible |
| Case integration | None | Direct "Create Case" from signal |
| Best for | Operational alerts (DevOps/SRE) | Security investigation (SOC/IR teams) |
Recommendation: Use both. Log Monitors for immediate Slack notification to storage team. Cloud SIEM Detection Rules for SOC triage workflow and MITRE-mapped investigation.
Three detection rules generate Security Signals when FSxN audit logs match threat patterns:
| Rule | MITRE ATT&CK | Trigger | Severity |
|---|---|---|---|
| Mass File Deletion | T1485 Data Destruction | >50 deletes/5min per user | Critical |
| Brute Force File Access | T1110 Brute Force | >20 failures/15min per user+IP | High |
| Permission Tampering | T1222 File Permissions Modification | >5 changes/10min per user | High |
# Creating a Security Signal rule via API
rule = {
"name": "FSxN: Mass File Deletion",
"type": "log_detection",
"queries": [{
"query": "source:fsxn @event_type:4660",
"groupByFields": ["@user"],
"aggregation": "count",
"name": "delete_events",
"dataSource": "logs"
}],
"cases": [
{"condition": "delete_events > 50", "status": "critical", "name": "Critical"},
{"condition": "delete_events > 20", "status": "high", "name": "High"}
],
"options": {
"evaluationWindow": 300,
"keepAlive": 3600,
"maxSignalDuration": 86400,
"detectionMethod": "threshold"
},
"message": "User {{@user}} deleted {{value}} files in 5 min. T1485.",
"tags": ["source:fsxn", "technique:T1485-data-destruction"]
}
Each signal includes investigation steps and response guidance directly in the signal panel — analysts don't need to leave the Security Signals view to understand what happened.
Cloud SIEM Setup Steps
- Navigate to Security → Cloud SIEM → Get Started
- Skip Content Packs (FSxN is a custom source)
- Select
fsxnas a log source → Enable as Trial - Configure Cloud SIEM Index (default: 450 days retention)
- Detection rules are automatically applied to incoming
source:fsxnlogs
Workflow Automation
When a critical monitor fires, the Workflow automatically triggers response actions:
Monitor Alert → @workflow-fsxn-security-alert-response → Slack notification + Investigation links
The workflow (fsxn-security-alert-response) is linked to monitors via the @workflow-<handle> mention syntax in the monitor message. No additional configuration needed — just add the mention to any monitor's notification body.
Extensions (via Datadog Workflow builder):
- Auto-create Jira ticket with investigation context
- Query Active Directory for user's manager
- Invoke
fsxn-snapshot-remediationLambda for evidence preservation (deployed — see Automated Snapshot Remediation)
Investigation Notebook
A pre-built 5-step investigation template guides analysts through alert triage:
| Step | Content | Widget |
|---|---|---|
| 1. Identify Scope | Event type distribution timeline | Timeseries (bars) |
| 2. User Timeline | Delete events by user over time | Timeseries (line) |
| 3. Affected Files | Top 20 deleted paths | Top List |
| 4. Client IP Analysis | Top 10 source IPs | Top List |
| 5. Conclusion | Root cause, impact, action template | Markdown |
Access at: Notebooks → "FSxN Audit Log Investigation Template"
Each step includes pre-configured queries — analysts just adjust the time window and user filter to match the alert context.
RBAC and Access Control
The FSxN Security Analyst role provides scoped access to audit logs:
| Role | Permissions | Use Case |
|---|---|---|
| FSxN Security Analyst |
logs_read_data, logs_read_index_data
|
SOC analysts investigating file access |
| Datadog Standard Role | Full access | Storage admins managing pipeline |
Assign users to this role to grant log access without exposing infrastructure metrics or APM data. Combined with Saved Views, analysts see only the investigation tools relevant to their role.
OTel Collector Bridge
For teams already running OpenTelemetry Collector, an alternative delivery path routes FSxN logs through OTel with trace context injection:
# otel-bridge/collector-config.yaml (excerpt)
receivers:
otlp:
protocols:
http: { endpoint: 0.0.0.0:4318 }
processors:
resource:
attributes:
- key: service.name
value: ontap-audit
action: upsert
transform:
log_statements:
- context: log
statements:
- set(attributes["ddsource"], "fsxn")
exporters:
datadog:
api:
key: ${env:DD_API_KEY}
site: ${env:DD_SITE:-ap1.datadoghq.com}
Benefits over direct Lambda→Datadog:
- trace_id injection for distributed tracing correlation
- Multi-backend fanout (Datadog + Grafana + S3 in parallel)
- Attribute enrichment at collector level (team, environment tags)
- Sampling for high-volume environments
- Edge-side PII redaction — Mask sensitive fields before they leave your AWS account (stronger than relying solely on Datadog's Sensitive Data Scanner)
Edge redaction pattern: Use OTel's
transformprocessor to hashuserandclient_ipand truncatepathto directory level BEFORE export to Datadog. This ensures PII never crosses your network boundary — even if Datadog's SDS configuration is misconfigured or disabled, your data is already protected at the source.Status: Config syntax verified. Integration testing requires deploying OTel Collector on ECS (see Part 7 for the ECS-based Collector pattern used with Grafana and Honeycomb).
Log Archives and Compliance
For regulatory retention (adjust based on your organization's audit policy), a dedicated CloudFormation template deploys an S3 archive with Glacier lifecycle:
aws cloudformation deploy \
--template-file integrations/datadog/template-log-archive.yaml \
--stack-name fsxn-datadog-archive \
--parameter-overrides \
DatadogExternalId=<from-datadog-integration-page> \
RetentionDays=30 \
GlacierRetentionDays=2555 \
--capabilities CAPABILITY_NAMED_IAM
This separates concerns:
- Detection — Hot logs in Datadog (15-30 day index retention)
- Compliance — Cold archive in S3 → Glacier (7+ years)
- Investigation — Rehydrate specific time ranges on demand
Anomaly Detection (ML-based)
Beyond static thresholds, the anomaly monitor on fsxn.audit.delete_count uses Datadog's agile algorithm to learn each user's baseline:
# Anomaly monitor — no manual threshold needed
"avg(last_4h):anomalies(
sum:fsxn.audit.delete_count{*} by {user}.as_count(),
'agile', 3, direction='above'
) >= 1"
When a user who normally deletes 5 files/day suddenly deletes 200, the alert fires — even though 200 is well below the static "50 in 5 minutes" threshold. This catches slow, sustained exfiltration that threshold-based monitors miss.
Baseline period: The anomaly algorithm needs ~2 weeks of data to build confidence. Deploy early — it stays silent during the learning period.
Cardinality Management
⚠️ Log-based metrics with group_by: user create one time series per unique user. For organizations with 10,000+ users:
| Metric | Recommended group_by | Reason |
|---|---|---|
fsxn.audit.event_count |
svm only | Broad metric, low cardinality |
fsxn.audit.delete_count |
user, svm | Targeted, high signal |
fsxn.audit.access_failure_count |
user, svm, client_ip | Investigation-focused |
fsxn.audit.unique_users |
user | Intentionally per-user |
Monitor cardinality in Metrics Summary (fsxn.audit.*). Datadog bills per unique tag-value combination — keep per-user grouping only on metrics where individual user behavior matters.
What's Different from Splunk/CrowdStrike?
| Aspect | Datadog | Splunk (Part 8) | CrowdStrike (Part 15) |
|---|---|---|---|
| Delivery protocol | Logs API v2 (JSON) | HEC (JSON) | HEC (JSON) |
| Pipeline config | API-driven (Python) | props.conf / UI | LogScale parser YAML |
| Monitor creation | API (POST /api/v1/monitor) |
Saved searches + alerts | CQL + alert actions |
| EventID mapping | Category Processor | eval/lookup | LogScale parser |
| Identity correlation |
usr.id attribute |
user field |
Falcon Identity |
| Free tier | ❌ (paid only for logs) | ❌ | ❌ |
| API management | ✅ Full (Pipeline + Monitor + Dashboard) | Partial | Limited |
The key Datadog advantage: everything is API-manageable. Pipeline, monitors, dashboards, and saved views can all be created, updated, and version-controlled through the Datadog API — making infrastructure-as-code for observability fully achievable.
Production Recommendations
Use the API, not the UI — Version-control your Pipeline, Monitor, and Dashboard definitions in a setup script. When you need to adjust thresholds or add EventIDs, it's a code change with review.
Separate APP key from API key — The API key ships logs; the APP key manages configuration. Store both in Secrets Manager with different IAM policies.
Exclude service accounts — Add
@user:-svc-*to monitor queries or create Datadog suppression rules for known batch accounts.Start with Warning, promote to Critical — Deploy monitors with Warning thresholds first, observe for 1-2 weeks, then tighten to Critical after confirming the baseline.
Connect to incident workflow — Route Critical monitors to PagerDuty/Slack and include investigation links (Saved View URLs) in the alert message.
Enable Log Archives for compliance — Configure S3 archive with Glacier lifecycle for FISC/SOC2 retention requirements. Rehydrate on demand for historical investigation.
Use Workflow Automation for response — Configure Datadog Workflows to automatically create Jira tickets, trigger Slack quick-actions, or invoke Lambda remediation when critical monitors fire.
Manage with Terraform for production — For enterprise/multi-org deployments, use the Datadog Terraform Provider to version-control Pipelines, Monitors, and Detection Rules. Separate API key (
logs_writeonly) from APP key (admin-level, CI/CD pipeline only).Apply IAM Permissions Boundary — In large organizations, apply a Permissions Boundary to the log-shipping Lambda role to prevent privilege escalation. The boundary should allow only
s3:GetObject,s3:ListBucket,secretsmanager:GetSecretValue, andlogs:*(CloudWatch).Encrypt with customer-managed keys (CMK) — For regulated environments, enable Datadog Log Management Encryption with your KMS CMK, and configure the S3 archive bucket with SSE-KMS. This ensures audit logs are encrypted with keys you control at every stage.
Tune detection rules (anti-alert-fatigue) — Run all new Cloud SIEM rules in "Warning / dev" status for 2 weeks. Whitelist known service accounts with
@usr.id:-svc-*in queries. Review false positives weekly and adjust thresholds or add suppression rules before promoting to Critical.Data Integrity: Handle HTTP 429/5xx — Datadog Logs API v2 returns HTTP 202 (accepted, not indexed). On HTTP 429 (rate limit) or 5xx (transient), implement exponential backoff (base 1s, max 5 retries). After final failure, route the complete HEC payload to SQS DLQ for replay. Never drop logs silently.
Deployment
# 1. Deploy the CloudFormation stack (Lambda + Scheduler + DLQ + Alarms)
aws cloudformation deploy \
--template-file integrations/datadog/template.yaml \
--stack-name fsxn-datadog-integration \
--parameter-overrides \
FsxS3AccessPointArn=arn:aws:s3:<region>:<account>:accesspoint/<name> \
DatadogApiKeySecretArn=<secret-arn> \
DatadogSite=ap1.datadoghq.com \
--capabilities CAPABILITY_NAMED_IAM
# 2. Deploy full observability (Pipeline + Monitors + Metrics + Scanner)
export DD_API_KEY_SECRET_ID="fsxn-datadog-api-key"
export DD_APP_KEY_SECRET_ID="datadog/fsxn-app-key"
export DD_SITE="ap1.datadoghq.com"
bash integrations/datadog/scripts/setup-full-observability.sh
# 3. (Optional) Deploy Log Archive for compliance
aws cloudformation deploy \
--template-file integrations/datadog/template-log-archive.yaml \
--stack-name fsxn-datadog-archive \
--parameter-overrides DatadogExternalId=<external-id> \
--capabilities CAPABILITY_NAMED_IAM
# 4. Create Facets (guided manual step)
bash integrations/datadog/scripts/setup-facets.sh
# 5. Enable Cloud SIEM (UI: Security → Cloud SIEM → Get Started → select fsxn)
Time from zero to full detection capability: ~45 minutes.
Pre-Deployment Checklist
Before deploying to production, confirm:
- [ ] FSx audit configuration confirmed (XML format, rotation schedule)
- [ ] S3 Access Point created and Lambda role authorized
- [ ] Datadog API Key + APP Key stored in Secrets Manager
- [ ] Datadog site region confirmed (AP1/US1/EU1)
- [ ] Data classification sign-off: audit logs contain user PII (usernames, IPs, file paths) — confirm external transmission is approved per organization policy
- [ ] Retention requirements defined (Datadog index retention + S3 archive lifecycle)
- [ ] On-call/escalation path defined for Critical signals
- [ ] Service accounts identified for monitor exclusion (
svc-*) - [ ] Cloud SIEM trial or paid plan confirmed (for Security Signals)
- [ ] Network path validated (VPC-external Lambda or NAT Gateway for S3 AP access)
Lessons Learned
1. API-first beats UI-first
Creating the Pipeline via API took 30 seconds. Doing it through the UI takes 5 minutes of clicking through dropdowns. More importantly, the API approach is repeatable, reviewable, and version-controlled.
2. Facets are UX, not functionality
Every field is searchable with @field:value whether or not you create a Facet. Facets just add sidebar convenience. Don't block on Facet setup — your monitors work without them.
3. Category Processor is the key transformation
Without it, analysts need to memorize Windows EventID codes. With it, @operation_name:Object Delete is immediately understandable. This single processor provides more investigative value than any other pipeline step.
4. Evaluation delay prevents false positives
The 60-second evaluation_delay on monitors accounts for ingestion latency. Without it, you get spurious alerts when log batches arrive slightly out of order.
5. Saved Views are investigation shortcuts
When a monitor fires at 3 AM, the on-call engineer needs to investigate immediately. Pre-built Saved Views with the right columns and filters eliminate setup time during incidents.
6. Anomaly detection needs baseline data
The anomaly monitor requires at least 2 weeks of historical data to build a reliable baseline. Deploy it early — it won't generate false positives during the learning period (it simply stays silent until confident).
7. Cloud SIEM and Log Monitors are complementary
Log Monitors give you fast Slack/PagerDuty notifications for the on-call team. Cloud SIEM Detection Rules give SOC analysts a structured triage workflow with MITRE mapping. Deploy both — they serve different audiences.
8. GeoIP is free threat intelligence
The GeoIP Processor is zero-cost and instantly useful. Even in a pure on-premises environment, unusual geography flags VPN misconfiguration, compromised credentials, or travel you weren't expecting.
9. Forwarder deployment enables an ecosystem
Once the Datadog Forwarder Lambda is deployed, enabling CloudTrail, GuardDuty, Lambda logs, or any other AWS log source is a single S3 notification configuration — no new infrastructure needed.
CloudTrail Correlation (Cross-Service Detection)
A dedicated detection rule correlates FSxN audit events with AWS CloudTrail context. When the signal fires, the investigation message includes pre-built queries for cross-service correlation:
source:cloudtrail @userIdentity.arn:*<username>* → IAM actions
source:guardduty → GuardDuty findings
source:vpc @network.client.ip:<suspicious-ip> → Network flows
The rule (FSxN: Suspicious Deletion After IAM Role Assumption) fires on >30 deletions in 10 minutes and embeds a correlation checklist:
- Was there a recent
AssumeRoleorConsoleLoginfrom an unusual IP? - Did the user's permissions change in the last 24 hours?
- Is the client IP in the corporate VPN range?
- Are there GuardDuty findings for this account?
Note: This rule activates fully once CloudTrail logs are flowing to Datadog via the AWS Integration. The FSxN detection works immediately; CloudTrail queries return results after integration setup.
SOC Triage Runbook
A 7-step Notebook guides analysts from signal to resolution:
| Step | Action | Output |
|---|---|---|
| 1. Signal Context | Verify user, IP, time, volume | Timeseries widget |
| 2. User Verification | Check AD status, role changes, maintenance windows | Manual checklist |
| 3. Impact Assessment | Identify affected files and directories | Top List (paths) |
| 4. Cross-Service Correlation | CloudTrail, GuardDuty, VPC Flow Logs | Query templates |
| 5. Decision Matrix | Map conditions to response actions | Decision table |
| 6. Response Actions | Disable account, snapshot, notify, ticket | Checklist |
| 7. Post-Incident | Update thresholds, document, lessons learned | Checklist |
Accessible at: Notebooks → "FSxN Security Signal Triage Runbook"
Case Management
Cases provide structured investigation tracking across the SOC team:
-
Project:
FSXN— All FSx for ONTAP security investigations - Case creation: Manual from Security Signals, or auto-create via Workflow
- Priority levels: P1 (critical mass deletion) through P4 (informational)
- Lifecycle: Open → In Progress → Resolved → Closed
When a critical signal fires, the workflow creates a case automatically with:
- Signal link and context
- Affected user and IP
- Investigation notebook link
- Response checklist
Threat Intelligence (GeoIP Enrichment)
The Log Pipeline's GeoIP Processor automatically enriches every FSxN log with geographic data from the client_ip field:
@client_ip: 10.0.5.99 → @network.client.geoip: { country: "JP", city: "Tokyo", ... }
A detection rule flags access from unexpected countries:
# Rule: "FSxN: File Access from Unusual Geography"
# Query: source:fsxn -@network.client.geoip.country:JP -@network.client.geoip.country:US
# Threshold: >5 events/15min from non-JP/US IPs
This catches compromised credentials being used from abroad without needing an external threat intelligence feed — Datadog's built-in GeoIP database handles the enrichment at ingest time.
GuardDuty Correlation
A detection rule correlates FSxN deletion events with GuardDuty findings from the same source IP:
FSxN: >10 deletions/30min from IP X
+ GuardDuty finding for IP X (UnauthorizedAccess, Recon, Trojan)
= Critical Security Signal with full context
The rule's investigation message includes pre-built GuardDuty queries:
source:guardduty @detail.resource.instanceDetails.networkInterfaces.privateIpAddress:{{@client_ip}}
GuardDuty findings flow to Datadog via the same Forwarder Lambda that handles CloudTrail — no additional setup needed.
Automated Snapshot Remediation
When mass deletion is confirmed (Critical signal reviewed by analyst), the Workflow invokes fsxn-snapshot-remediation Lambda:
Security Signal (Critical) → Analyst confirms → Workflow → Lambda → ONTAP REST API → Snapshot
The Lambda:
- Receives volume name, SVM, and reason from the Workflow
- Authenticates to ONTAP via Secrets Manager credentials
- Creates a timestamped snapshot:
remediation_20260614_215530_mass_deletion - Returns snapshot name and status for the Case record
# Lambda invocation payload (from Datadog Workflow)
{
"volume_name": "finance_share",
"svm_name": "ProductionSVM",
"reason": "Mass deletion detected",
"user": "CORP\\suspicious-user"
}
Configuration: Set
ONTAP_MGMT_IPandONTAP_CREDENTIALS_SECRET_ARNenvironment variables on the Lambda to point to your FSx for ONTAP management endpoint.TLS note: The Lambda uses
cert_reqs="CERT_NONE"for ONTAP REST API calls because FSx for ONTAP uses self-signed certificates by default. In production, upload the ONTAP CA certificate to the Lambda layer (/opt/certs/ontap-ca.pem) and configureurllib3.PoolManager(ca_certs="/opt/certs/ontap-ca.pem")to validate the connection.
Snapshot Storm Prevention (Cooldown)
To prevent runaway Snapshot creation during sustained mass-deletion events:
# Before creating snapshot, check for recent remediation snapshots
existing = http.request("GET",
f"https://{mgmt_ip}/api/storage/volumes/{vol_uuid}/snapshots"
f"?name=remediation_*&order_by=create_time desc&max_records=1",
headers=base_headers)
if existing.status == 200:
snaps = json.loads(existing.data).get("records", [])
if snaps:
last_snap_time = snaps[0].get("create_time", "")
# Skip if a remediation snapshot was created in the last 15 minutes
if is_within_cooldown(last_snap_time, minutes=15):
return {"statusCode": 200, "body": "Skipped — cooldown active"}
ONTAP limits: Each volume supports up to 1023 Snapshots. The cooldown prevents hitting this limit during sustained attack scenarios.
Remediation Audit Trail
The Lambda invocation itself must be auditable — proving that the Snapshot was created by an authorized pipeline, not a rogue actor:
-
CloudTrail: Lambda invocation recorded with
InvokedBy: workflow.datadoghq.com - ONTAP audit log: Snapshot creation appears as an administrative event with the API user
- Datadog Case: Snapshot name and status recorded in the Case timeline
- Lambda CloudWatch Logs: Full request/response logged with correlation ID
What's Next
This article covered the complete detection-to-response lifecycle. Future enhancements:
- Datadog Threat Intel feeds — When Datadog makes the Threat Intel Indicators API available on AP1, feed internal IP reputation lists for richer enrichment
- Multi-SVM snapshot orchestration — Extend the remediation Lambda to snapshot across multiple SVMs in parallel
- Automated account lockout — Chain the Workflow to invoke AD lockout via Systems Manager Run Command










Top comments (0)