DEV Community

Cover image for From Logs to Detection: Building a File Security Pipeline in Datadog for FSx for ONTAP

From Logs to Detection: Building a File Security Pipeline in Datadog for FSx for ONTAP

TL;DR

For security teams: Four threshold monitors + one ML anomaly monitor + four Cloud SIEM detection rules catch mass file deletion, data exfiltration, permission tampering, and unusual geography in under 5 minutes. EventID codes become human-readable operation names. All created via Datadog API — no manual clicking required.

For engineers: The Log Pipeline (6 processors including GeoIP enrichment) transforms raw ONTAP XML events into searchable, alertable fields. Five Saved Views cover the most common investigation patterns. Total AWS cost remains ~$1.50/month.

For architects: This pattern demonstrates the full detection-to-response lifecycle — from raw file system audit events through enrichment, categorization, alerting, cross-service correlation, and automated remediation — entirely serverless, entirely as code.

FSx for ONTAP → S3 Access Point → Lambda → Datadog Logs API v2
                                              │
                                              ▼
                                   ┌──────────────────────┐
                                   │ Log Pipeline (6)     │
                                   │  • Category Processor│
                                   │  • Status Remapper   │
                                   │  • Date Remapper     │
                                   │  • Attribute Remapper│
                                   │  • GeoIP Enrichment  │
                                   └──────────┬───────────┘
                                              │
                          ┌───────────────────┼───────────────────┐
                          ▼                   ▼                   ▼
                ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
                │ Log Monitors (4)│ │ Cloud SIEM (4)  │ │ Log Archive     │
                │ + Anomaly (1)   │ │ Security Signals│ │ → S3 → Glacier  │
                └────────┬────────┘ └────────┬────────┘ └─────────────────┘
                         │                   │
                         ▼                   ▼
                ┌──────────────────────────────────────┐
                │ Workflow → Slack + Case + Lambda     │
                │          (ONTAP Snapshot remediation)│
                └──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Reading guide: Sections 1-9 cover the core pipeline (30 min to deploy). Sections 10+ cover advanced SOC integration (additional 15 min). Jump to Deployment if you want to start immediately.

Business value: This pipeline reduces mean-time-to-detect (MTTD) for file-based threats from "hours/days" (batch log review) to under 5 minutes (real-time alerting). For regulated environments, it provides continuous compliance evidence with automated archival — replacing manual quarterly reviews with always-on monitoring.

This is Part 16 of Serverless Observability for FSx for ONTAP.


Why Post-Ingestion Processing Matters

In Part 1, we shipped raw audit events to Datadog. That's necessary but insufficient. Raw events look like this:

{
  "event_type": "4660",
  "user": "CORP\\contractor-ext-03",
  "path": "/share/engineering/designs/prototype-v3.dwg",
  "result": "Audit Failure"
}
Enter fullscreen mode Exit fullscreen mode

Questions that raw logs can't answer quickly:

  • What is EventID 4660? (Answer: Object Delete)
  • Should this alarm? (Answer: depends on volume and context)
  • Who should investigate? (Answer: Storage team + SOC)
  • What's the user's baseline? (Answer: need faceted historical view)

This article builds the processing layers that transform raw data into actionable security intelligence.


Architecture

Why EventBridge Scheduler? FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge object-level events. Lambda polls on a schedule and uses checkpointing to process only new files. XML format is chosen for Lambda-native parsing without binary dependencies (vs EVTX which requires Windows-specific libraries).

┌──────────────────────────────────────────────────────────────────┐
│ Lambda Handler (Python 3.12)                                     │
│  • Parse XML → Normalize → HEC-compatible JSON                   │
│  • POST to Datadog Logs API v2                                   │
│  • Fields: event_type, user, path, client_ip, svm, result        │
└──────────────────────────────────────────────────────────────────┘
        │  HTTP 202
        ▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Log Pipeline: "FSx for ONTAP Audit Logs"                 │
│ Filter: source:fsxn                                              │
│                                                                  │
│ 1. Category Processor → @operation_name                          │
│    4663→Object Access, 4660→Object Delete, 4656→Handle Request   │
│                                                                  │
│ 2. Status Remapper → log severity from @result                   │
│    "Audit Success" → info, "Audit Failure" → error               │
│                                                                  │
│ 3. Date Remapper → @timestamp as official log time               │
│                                                                  │
│ 4. Attribute Remapper → @user→usr.id, @client_ip→network.client  │
└──────────────────────────────────────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────────────────┐
│ Datadog Security Monitors (3)                                    │
│                                                                  │
│ • [FSxN] Mass File Deletion    → >50 deletes/5min per user       │
│ • [FSxN] Abnormal Access Volume → >1000 accesses/1h per user     │
│ • [FSxN] Access Failure Spike  → >10 failures/15min per user     │
└──────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The Log Pipeline

Datadog Log Pipeline for FSxN

Why a Pipeline?

Without a pipeline, every query requires remembering that 4660 means "delete" and 4663 means "access." With a pipeline, you search @operation_name:Object Delete — and Datadog handles the translation at ingest time.

Creating the Pipeline via API

import json, urllib3, boto3

sm = boto3.client('secretsmanager', region_name='ap-northeast-1')
api_key = sm.get_secret_value(SecretId='fsxn-datadog-api-key')['SecretString']
app_key = sm.get_secret_value(SecretId='datadog/fsxn-app-key')['SecretString']

http = urllib3.PoolManager()

pipeline = {
    "name": "FSx for ONTAP Audit Logs",
    "is_enabled": True,
    "filter": {"query": "source:fsxn"},
    "processors": [
        {
            "type": "category-processor",
            "name": "EventID to Operation Name",
            "is_enabled": True,
            "target": "operation_name",
            "categories": [
                {"filter": {"query": "@event_type:4663"}, "name": "Object Access"},
                {"filter": {"query": "@event_type:4656"}, "name": "Handle Request"},
                {"filter": {"query": "@event_type:4660"}, "name": "Object Delete"},
                {"filter": {"query": "@event_type:4670"}, "name": "Permission Change"},
                {"filter": {"query": "@event_type:4658"}, "name": "Handle Close"},
                {"filter": {"query": "@event_type:5140"}, "name": "Share Access"},
                {"filter": {"query": "@event_type:5145"}, "name": "Share Check"},
                {"filter": {"query": "@event_type:4624"}, "name": "Logon"},
                {"filter": {"query": "@event_type:4634"}, "name": "Logoff"},
            ]
        },
        {
            "type": "status-remapper",
            "name": "Map result to log status",
            "is_enabled": True,
            "sources": ["result"]
        },
        {
            "type": "date-remapper",
            "name": "Use event timestamp",
            "is_enabled": True,
            "sources": ["timestamp"]
        },
        {
            "type": "attribute-remapper",
            "name": "Map user to usr.id",
            "is_enabled": True,
            "sources": ["user"],
            "source_type": "attribute",
            "target": "usr.id",
            "target_type": "attribute",
            "preserve_source": True,
            "override_on_conflict": False
        },
        {
            "type": "attribute-remapper",
            "name": "Map client_ip to network.client.ip",
            "is_enabled": True,
            "sources": ["client_ip"],
            "source_type": "attribute",
            "target": "network.client.ip",
            "target_type": "attribute",
            "preserve_source": True,
            "override_on_conflict": False
        }
    ]
}

resp = http.request("POST",
    "https://api.ap1.datadoghq.com/api/v1/logs/config/pipelines",
    body=json.dumps(pipeline).encode("utf-8"),
    headers={
        "Content-Type": "application/json",
        "DD-API-KEY": api_key,
        "DD-APPLICATION-KEY": app_key
    })
print(f"HTTP {resp.status}: Pipeline ID = {json.loads(resp.data)['id']}")
Enter fullscreen mode Exit fullscreen mode

Site note: Replace api.ap1.datadoghq.com with your Datadog site's API endpoint (e.g., api.datadoghq.com for US1, api.datadoghq.eu for EU1).

EventID Mapping Table

EventID Operation Name MITRE ATT&CK Description
4663 Object Access T1005 (Data from Local System) File read/write operation
4656 Handle Request T1005 File handle opened
4660 Object Delete T1485 (Data Destruction) File deleted
4670 Permission Change T1222 (File Permissions Modification) ACL/permission modified
4658 Handle Close File handle closed (low signal)
5140 Share Access T1021.002 (SMB/Windows Admin) SMB share connected
5145 Share Check T1135 (Network Share Discovery) Share permission checked
4624 Logon T1078 (Valid Accounts) Authentication event
4634 Logoff Session ended

Security Monitors

Datadog Security Monitors for FSxN

Monitor 1: Mass File Deletion

mass_delete = {
    "name": "[FSxN] Mass File Deletion Detected",
    "type": "log alert",
    "query": 'logs("source:fsxn @event_type:4660").index("*").rollup("count").by("@user").last("5m") > 50',
    "message": """## Mass File Deletion Alert

A user has triggered more than 50 file deletion events within 5 minutes.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Check affected paths: `source:fsxn @event_type:4660 @user:{{@user}}`
2. Verify if this is a scheduled cleanup or authorized bulk operation
3. Check the client IP for unexpected sources
4. Correlate with user's normal deletion patterns

@slack-storage-alerts""",
    "tags": ["source:fsxn", "team:storage", "severity:high"],
    "options": {
        "thresholds": {"critical": 50, "warning": 20},
        "notify_no_data": False,
        "renotify_interval": 60,
        "evaluation_delay": 60
    }
}
Enter fullscreen mode Exit fullscreen mode

Why 50? In our test environment, normal daily deletions per user stay under 10. The threshold should be tuned based on your environment's baseline — run source:fsxn @event_type:4660 | stats count by @user over a week to establish what's normal.

Monitor 2: Abnormal Access Volume

abnormal_access = {
    "name": "[FSxN] Abnormal Access Volume",
    "type": "log alert",
    "query": 'logs("source:fsxn @result:\\"Audit Success\\"").index("*").rollup("count").by("@user").last("1h") > 1000',
    "message": """## Abnormal Access Volume Alert

More than 1000 successful file access events in 1 hour.
May indicate data exfiltration or unauthorized bulk access.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Review patterns: `source:fsxn @user:{{@user}}`
2. Check if backup jobs or batch operations are running
3. Verify client IP and compare with known locations
4. Check accessed paths for sensitive content

@slack-security-alerts""",
    "options": {
        "thresholds": {"critical": 1000, "warning": 500},
        "evaluation_delay": 60
    }
}
Enter fullscreen mode Exit fullscreen mode

Tuning note: Backup service accounts (svc-backup, svc-indexer) generate legitimate high-volume access. Exclude them with @user:-svc-* in the query or create a suppression rule.

Monitor 3: Access Failure Spike

failure_spike = {
    "name": "[FSxN] Access Failure Spike",
    "type": "log alert",
    "query": 'logs("source:fsxn @result:\\"Audit Failure\\"").index("*").rollup("count").by("@user").last("15m") > 10',
    "message": """## Access Failure Spike

More than 10 access failures in 15 minutes.
May indicate unauthorized access attempts or permission misconfiguration.

**User**: {{@user}}  |  **Count**: {{value}}

### Investigation Steps
1. Check failed paths: `source:fsxn @result:"Audit Failure" @user:{{@user}}`
2. Verify if permissions were recently changed
3. Check if user is accessing resources outside their scope
4. Correlate with AD group membership changes

@slack-security-alerts""",
    "options": {
        "thresholds": {"critical": 10, "warning": 5},
        "evaluation_delay": 60
    }
}
Enter fullscreen mode Exit fullscreen mode

Saved Views

Five pre-configured views for common investigation patterns:

View Query When to use
FSxN File Deletions source:fsxn @event_type:4660 Monitor triggered, investigating which files
FSxN Access Failures source:fsxn @result:"Audit Failure" Permission denied investigation
FSxN All Events source:fsxn General audit stream
FSxN Sensitive Share Access source:fsxn (@path:*finance* OR @path:*hr* OR @path:*legal*) Sensitive data access review
FSxN After-Hours Access source:fsxn (filtered by time) Off-hours activity detection

Each view includes pre-configured columns (user, path, client_ip, event_type) so analysts don't need to customize the table every time.


Facets for One-Click Filtering

Datadog Log Explorer with Facets

Facets add clickable filters to the left sidebar. Create them from any log entry's detail panel:

Facet Field Why
Event Type @event_type Filter to specific operations (4660=delete, 4663=access)
Operation @operation_name Human-readable version (after pipeline applies)
User @user Filter by specific user under investigation
SVM @svm Isolate events to specific storage virtual machines
File Path @path Filter by directory or share
Client IP @client_ip Filter by source workstation/server
Result @result Quick split between success and failure events

Note: Fields are searchable with @field:value syntax even without Facets. Facets add UI convenience — they're not required for alerting or saved views.


Cost Reality

Component Monthly Cost Notes
Lambda (5-min schedule, ~1s execution) ~$0.50 8,640 invocations/month
Secrets Manager (2 secrets) ~$0.80 API key + APP key
EventBridge Scheduler ~$0.00 Free tier covers this
S3 AP reads ~$0.05 Depends on file count
Datadog Forwarder Lambda ~$0.10 CloudTrail event processing
S3 Archive storage ~$0.02 Glacier tier after 30 days
AWS total ~$1.50/month
Datadog log ingestion ~$0.10/GB Conventional pricing
Total (1GB/month) ~$1.60/month

Datadog pricing caveat: Log ingestion pricing varies by plan (Conventional vs On-Demand) and contract. The ~$0.10/GB is the list price for conventional plans; actual cost depends on your contract, committed volume, and retention choices.

Cloud SIEM pricing: Cloud SIEM is a separate paid feature (~$0.20/GB analyzed for log detection). A 30-day free trial is available. Security Signals, Detection Rules, and the triage workflow require Cloud SIEM. Log Monitors (threshold alerts to Slack) work without it.

Cost optimization patterns

  1. Filter before shipping — Drop low-value EventIDs (4658 Handle Close) at the Lambda level to reduce ingestion volume
  2. Compress payloads — Enable gzip on the HTTP POST to reduce egress
  3. Tune audit scope — Configure ONTAP vserver audit to monitor only the shares that matter
  4. Index retention — Use Datadog's flexible retention (15/30/60/90 days) to match compliance needs vs cost

E2E Verification Results

Verified on Datadog AP1 (ap1.datadoghq.com) with paid plan, June 2026:

Component Status
Lambda → Datadog Logs API v2 ✅ HTTP 202
Log Pipeline (5 processors) ✅ Applied automatically
Category Processor (9 EventIDs) @operation_name populated
Status Remapper ✅ Failure events marked as error
Date Remapper ✅ Event timestamp used (not ingest time)
Attribute Remapper (usr.id) ✅ Enables Datadog user features
Mass Delete Monitor ALERT triggered (55 events > threshold 50)
Access Failure Monitor ALERT triggered (12 events > threshold 10)
Abnormal Access Monitor ✅ Active (OK — high threshold by design)
Anomaly Monitor (ML) ✅ Active (learning baseline)
Saved Views (5) ✅ Accessible from Views dropdown
Facets (8 custom) ✅ Created via UI
Dashboard (10 widgets) ✅ FSx ONTAP Audit Log Overview
Log-based Metrics (4) fsxn.audit.* in Metrics Explorer
Sensitive Data Scanner (5 rules) ✅ PII detection active
Cloud SIEM Detection Rules (4) ✅ Security Signals generated
CloudTrail Forwarder ✅ Logs arriving in Datadog
GeoIP Enrichment ✅ Country/city populated on client_ip
GuardDuty Correlation Rule ✅ Cross-service detection active
Workflow Automation ✅ Monitor → Slack + Case
Case Management (FSXN project) ✅ Project + template case
SOC Triage Runbook ✅ 7-step Notebook
Log Archive (S3 → Glacier) ✅ source:fsxn archived
Snapshot Remediation Lambda ✅ Deployed (fsxn-snapshot-remediation)

Log-based Metrics

Log-based metrics extract numerical values from logs at ingest time — you get metric-resolution dashboards and anomaly detection without paying for log retention beyond the evaluation window.

# Created via Datadog API (no UI needed)
metrics = [
    "fsxn.audit.delete_count",         # Delete events grouped by user/SVM
    "fsxn.audit.access_failure_count",  # Failures by user/SVM/client_ip
    "fsxn.audit.event_count",           # All events by event_type/SVM
    "fsxn.audit.unique_users",          # Active users
]
Enter fullscreen mode Exit fullscreen mode

Log-based Metrics Configuration

Use cases:

  • Anomaly detection on fsxn.audit.delete_count — Datadog's built-in anomaly monitor catches unusual deletion patterns without manual thresholds
  • SLO tracking — Define an SLO on access failure rate (fsxn.audit.access_failure_count / fsxn.audit.event_count)
  • Cost efficiency — Metrics are retained for 15 months at metric resolution (vs log retention at full-text cost)

Sensitive Data Scanner

Audit logs frequently contain PII in file paths (/hr/EMP-123456-performance-review.xlsx) and user contexts. Datadog's Sensitive Data Scanner catches these before they reach analysts.

Rule What It Catches Why It Matters
Employee ID (EMP-\d{6}) Employee records in paths PII exposure prevention
JP Phone (0[789]0-\d{4}-\d{4}) Mobile numbers in filenames Privacy protection
Email in Path Customer emails in file names Data minimization
Credit Card CC numbers in documents PCI-DSS technical control
My Number Japanese national ID Sensitive PII detection

Compliance note: Sensitive Data Scanner is a technical detection and redaction control. It does not constitute full APPI/GDPR/PCI-DSS compliance, which requires organizational measures (consent management, purpose limitation, data subject rights, etc.). Consult your legal/compliance team for regulatory assessment.

Sensitive Data Scanner

Matched patterns are partially redacted at ingest — the first 3 characters remain for identification, the rest is replaced with [REDACTED]. This preserves investigation capability while protecting the data subject.


Enhanced Dashboard

Enhanced Dashboard — 10 Widgets

The SOC-focused dashboard provides at-a-glance visibility into:

  • Who is generating the most activity (top users, top IPs)
  • What operations dominate (sunburst by operation/SVM)
  • Where access is concentrated (hot paths)
  • When failures spike (timeline with per-user breakdown)
  • Trend from log-based metrics (delete rate over time)

Cloud SIEM Security Signals

Beyond log monitors (which notify via Slack/PagerDuty), Security Signals integrate with Datadog's SOC workflow — triage, investigation, and response all in one place.

Log Monitors vs Cloud SIEM Detection Rules

Aspect Log Monitors Cloud SIEM Detection Rules
Output Alert notification (Slack/PagerDuty/email) Security Signal (appears in Security Signals panel)
Triage Manual (click alert → investigate) Built-in triage workflow (Open → In Progress → Archived)
MITRE mapping Manual tags Native MITRE ATT&CK framework integration
Correlation Single query Cross-log-source correlation possible
Case integration None Direct "Create Case" from signal
Best for Operational alerts (DevOps/SRE) Security investigation (SOC/IR teams)

Recommendation: Use both. Log Monitors for immediate Slack notification to storage team. Cloud SIEM Detection Rules for SOC triage workflow and MITRE-mapped investigation.

Cloud SIEM Detection Rules

Three detection rules generate Security Signals when FSxN audit logs match threat patterns:

Rule MITRE ATT&CK Trigger Severity
Mass File Deletion T1485 Data Destruction >50 deletes/5min per user Critical
Brute Force File Access T1110 Brute Force >20 failures/15min per user+IP High
Permission Tampering T1222 File Permissions Modification >5 changes/10min per user High
# Creating a Security Signal rule via API
rule = {
    "name": "FSxN: Mass File Deletion",
    "type": "log_detection",
    "queries": [{
        "query": "source:fsxn @event_type:4660",
        "groupByFields": ["@user"],
        "aggregation": "count",
        "name": "delete_events",
        "dataSource": "logs"
    }],
    "cases": [
        {"condition": "delete_events > 50", "status": "critical", "name": "Critical"},
        {"condition": "delete_events > 20", "status": "high", "name": "High"}
    ],
    "options": {
        "evaluationWindow": 300,
        "keepAlive": 3600,
        "maxSignalDuration": 86400,
        "detectionMethod": "threshold"
    },
    "message": "User {{@user}} deleted {{value}} files in 5 min. T1485.",
    "tags": ["source:fsxn", "technique:T1485-data-destruction"]
}
Enter fullscreen mode Exit fullscreen mode

Each signal includes investigation steps and response guidance directly in the signal panel — analysts don't need to leave the Security Signals view to understand what happened.

Cloud SIEM Setup Steps

Cloud SIEM Onboarding — Index Configuration

  1. Navigate to Security → Cloud SIEM → Get Started
  2. Skip Content Packs (FSxN is a custom source)
  3. Select fsxn as a log source → Enable as Trial
  4. Configure Cloud SIEM Index (default: 450 days retention)
  5. Detection rules are automatically applied to incoming source:fsxn logs

Workflow Automation

When a critical monitor fires, the Workflow automatically triggers response actions:

Monitor Alert → @workflow-fsxn-security-alert-response → Slack notification + Investigation links
Enter fullscreen mode Exit fullscreen mode

The workflow (fsxn-security-alert-response) is linked to monitors via the @workflow-<handle> mention syntax in the monitor message. No additional configuration needed — just add the mention to any monitor's notification body.

Extensions (via Datadog Workflow builder):

  • Auto-create Jira ticket with investigation context
  • Query Active Directory for user's manager
  • Invoke fsxn-snapshot-remediation Lambda for evidence preservation (deployed — see Automated Snapshot Remediation)

Investigation Notebook

A pre-built 5-step investigation template guides analysts through alert triage:

Step Content Widget
1. Identify Scope Event type distribution timeline Timeseries (bars)
2. User Timeline Delete events by user over time Timeseries (line)
3. Affected Files Top 20 deleted paths Top List
4. Client IP Analysis Top 10 source IPs Top List
5. Conclusion Root cause, impact, action template Markdown

Access at: Notebooks → "FSxN Audit Log Investigation Template"

Each step includes pre-configured queries — analysts just adjust the time window and user filter to match the alert context.


RBAC and Access Control

The FSxN Security Analyst role provides scoped access to audit logs:

Role Permissions Use Case
FSxN Security Analyst logs_read_data, logs_read_index_data SOC analysts investigating file access
Datadog Standard Role Full access Storage admins managing pipeline

Assign users to this role to grant log access without exposing infrastructure metrics or APM data. Combined with Saved Views, analysts see only the investigation tools relevant to their role.


OTel Collector Bridge

For teams already running OpenTelemetry Collector, an alternative delivery path routes FSxN logs through OTel with trace context injection:

# otel-bridge/collector-config.yaml (excerpt)
receivers:
  otlp:
    protocols:
      http: { endpoint: 0.0.0.0:4318 }

processors:
  resource:
    attributes:
      - key: service.name
        value: ontap-audit
        action: upsert
  transform:
    log_statements:
      - context: log
        statements:
          - set(attributes["ddsource"], "fsxn")

exporters:
  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: ${env:DD_SITE:-ap1.datadoghq.com}
Enter fullscreen mode Exit fullscreen mode

Benefits over direct Lambda→Datadog:

  • trace_id injection for distributed tracing correlation
  • Multi-backend fanout (Datadog + Grafana + S3 in parallel)
  • Attribute enrichment at collector level (team, environment tags)
  • Sampling for high-volume environments
  • Edge-side PII redaction — Mask sensitive fields before they leave your AWS account (stronger than relying solely on Datadog's Sensitive Data Scanner)

Edge redaction pattern: Use OTel's transform processor to hash user and client_ip and truncate path to directory level BEFORE export to Datadog. This ensures PII never crosses your network boundary — even if Datadog's SDS configuration is misconfigured or disabled, your data is already protected at the source.

Status: Config syntax verified. Integration testing requires deploying OTel Collector on ECS (see Part 7 for the ECS-based Collector pattern used with Grafana and Honeycomb).


Log Archives and Compliance

For regulatory retention (adjust based on your organization's audit policy), a dedicated CloudFormation template deploys an S3 archive with Glacier lifecycle:

aws cloudformation deploy \
  --template-file integrations/datadog/template-log-archive.yaml \
  --stack-name fsxn-datadog-archive \
  --parameter-overrides \
    DatadogExternalId=<from-datadog-integration-page> \
    RetentionDays=30 \
    GlacierRetentionDays=2555 \
  --capabilities CAPABILITY_NAMED_IAM
Enter fullscreen mode Exit fullscreen mode

This separates concerns:

  • Detection — Hot logs in Datadog (15-30 day index retention)
  • Compliance — Cold archive in S3 → Glacier (7+ years)
  • Investigation — Rehydrate specific time ranges on demand

Anomaly Detection (ML-based)

Beyond static thresholds, the anomaly monitor on fsxn.audit.delete_count uses Datadog's agile algorithm to learn each user's baseline:

# Anomaly monitor — no manual threshold needed
"avg(last_4h):anomalies(
    sum:fsxn.audit.delete_count{*} by {user}.as_count(),
    'agile', 3, direction='above'
) >= 1"
Enter fullscreen mode Exit fullscreen mode

When a user who normally deletes 5 files/day suddenly deletes 200, the alert fires — even though 200 is well below the static "50 in 5 minutes" threshold. This catches slow, sustained exfiltration that threshold-based monitors miss.

Baseline period: The anomaly algorithm needs ~2 weeks of data to build confidence. Deploy early — it stays silent during the learning period.


Cardinality Management

⚠️ Log-based metrics with group_by: user create one time series per unique user. For organizations with 10,000+ users:

Metric Recommended group_by Reason
fsxn.audit.event_count svm only Broad metric, low cardinality
fsxn.audit.delete_count user, svm Targeted, high signal
fsxn.audit.access_failure_count user, svm, client_ip Investigation-focused
fsxn.audit.unique_users user Intentionally per-user

Monitor cardinality in Metrics Summary (fsxn.audit.*). Datadog bills per unique tag-value combination — keep per-user grouping only on metrics where individual user behavior matters.


What's Different from Splunk/CrowdStrike?

Aspect Datadog Splunk (Part 8) CrowdStrike (Part 15)
Delivery protocol Logs API v2 (JSON) HEC (JSON) HEC (JSON)
Pipeline config API-driven (Python) props.conf / UI LogScale parser YAML
Monitor creation API (POST /api/v1/monitor) Saved searches + alerts CQL + alert actions
EventID mapping Category Processor eval/lookup LogScale parser
Identity correlation usr.id attribute user field Falcon Identity
Free tier ❌ (paid only for logs)
API management ✅ Full (Pipeline + Monitor + Dashboard) Partial Limited

The key Datadog advantage: everything is API-manageable. Pipeline, monitors, dashboards, and saved views can all be created, updated, and version-controlled through the Datadog API — making infrastructure-as-code for observability fully achievable.


Production Recommendations

  1. Use the API, not the UI — Version-control your Pipeline, Monitor, and Dashboard definitions in a setup script. When you need to adjust thresholds or add EventIDs, it's a code change with review.

  2. Separate APP key from API key — The API key ships logs; the APP key manages configuration. Store both in Secrets Manager with different IAM policies.

  3. Exclude service accounts — Add @user:-svc-* to monitor queries or create Datadog suppression rules for known batch accounts.

  4. Start with Warning, promote to Critical — Deploy monitors with Warning thresholds first, observe for 1-2 weeks, then tighten to Critical after confirming the baseline.

  5. Connect to incident workflow — Route Critical monitors to PagerDuty/Slack and include investigation links (Saved View URLs) in the alert message.

  6. Enable Log Archives for compliance — Configure S3 archive with Glacier lifecycle for FISC/SOC2 retention requirements. Rehydrate on demand for historical investigation.

  7. Use Workflow Automation for response — Configure Datadog Workflows to automatically create Jira tickets, trigger Slack quick-actions, or invoke Lambda remediation when critical monitors fire.

  8. Manage with Terraform for production — For enterprise/multi-org deployments, use the Datadog Terraform Provider to version-control Pipelines, Monitors, and Detection Rules. Separate API key (logs_write only) from APP key (admin-level, CI/CD pipeline only).

  9. Apply IAM Permissions Boundary — In large organizations, apply a Permissions Boundary to the log-shipping Lambda role to prevent privilege escalation. The boundary should allow only s3:GetObject, s3:ListBucket, secretsmanager:GetSecretValue, and logs:* (CloudWatch).

  10. Encrypt with customer-managed keys (CMK) — For regulated environments, enable Datadog Log Management Encryption with your KMS CMK, and configure the S3 archive bucket with SSE-KMS. This ensures audit logs are encrypted with keys you control at every stage.

  11. Tune detection rules (anti-alert-fatigue) — Run all new Cloud SIEM rules in "Warning / dev" status for 2 weeks. Whitelist known service accounts with @usr.id:-svc-* in queries. Review false positives weekly and adjust thresholds or add suppression rules before promoting to Critical.

  12. Data Integrity: Handle HTTP 429/5xx — Datadog Logs API v2 returns HTTP 202 (accepted, not indexed). On HTTP 429 (rate limit) or 5xx (transient), implement exponential backoff (base 1s, max 5 retries). After final failure, route the complete HEC payload to SQS DLQ for replay. Never drop logs silently.


Deployment

# 1. Deploy the CloudFormation stack (Lambda + Scheduler + DLQ + Alarms)
aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=arn:aws:s3:<region>:<account>:accesspoint/<name> \
    DatadogApiKeySecretArn=<secret-arn> \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM

# 2. Deploy full observability (Pipeline + Monitors + Metrics + Scanner)
export DD_API_KEY_SECRET_ID="fsxn-datadog-api-key"
export DD_APP_KEY_SECRET_ID="datadog/fsxn-app-key"
export DD_SITE="ap1.datadoghq.com"
bash integrations/datadog/scripts/setup-full-observability.sh

# 3. (Optional) Deploy Log Archive for compliance
aws cloudformation deploy \
  --template-file integrations/datadog/template-log-archive.yaml \
  --stack-name fsxn-datadog-archive \
  --parameter-overrides DatadogExternalId=<external-id> \
  --capabilities CAPABILITY_NAMED_IAM

# 4. Create Facets (guided manual step)
bash integrations/datadog/scripts/setup-facets.sh

# 5. Enable Cloud SIEM (UI: Security → Cloud SIEM → Get Started → select fsxn)
Enter fullscreen mode Exit fullscreen mode

Time from zero to full detection capability: ~45 minutes.

Pre-Deployment Checklist

Before deploying to production, confirm:

  • [ ] FSx audit configuration confirmed (XML format, rotation schedule)
  • [ ] S3 Access Point created and Lambda role authorized
  • [ ] Datadog API Key + APP Key stored in Secrets Manager
  • [ ] Datadog site region confirmed (AP1/US1/EU1)
  • [ ] Data classification sign-off: audit logs contain user PII (usernames, IPs, file paths) — confirm external transmission is approved per organization policy
  • [ ] Retention requirements defined (Datadog index retention + S3 archive lifecycle)
  • [ ] On-call/escalation path defined for Critical signals
  • [ ] Service accounts identified for monitor exclusion (svc-*)
  • [ ] Cloud SIEM trial or paid plan confirmed (for Security Signals)
  • [ ] Network path validated (VPC-external Lambda or NAT Gateway for S3 AP access)

Lessons Learned

1. API-first beats UI-first

Creating the Pipeline via API took 30 seconds. Doing it through the UI takes 5 minutes of clicking through dropdowns. More importantly, the API approach is repeatable, reviewable, and version-controlled.

2. Facets are UX, not functionality

Every field is searchable with @field:value whether or not you create a Facet. Facets just add sidebar convenience. Don't block on Facet setup — your monitors work without them.

3. Category Processor is the key transformation

Without it, analysts need to memorize Windows EventID codes. With it, @operation_name:Object Delete is immediately understandable. This single processor provides more investigative value than any other pipeline step.

4. Evaluation delay prevents false positives

The 60-second evaluation_delay on monitors accounts for ingestion latency. Without it, you get spurious alerts when log batches arrive slightly out of order.

5. Saved Views are investigation shortcuts

When a monitor fires at 3 AM, the on-call engineer needs to investigate immediately. Pre-built Saved Views with the right columns and filters eliminate setup time during incidents.

6. Anomaly detection needs baseline data

The anomaly monitor requires at least 2 weeks of historical data to build a reliable baseline. Deploy it early — it won't generate false positives during the learning period (it simply stays silent until confident).

7. Cloud SIEM and Log Monitors are complementary

Log Monitors give you fast Slack/PagerDuty notifications for the on-call team. Cloud SIEM Detection Rules give SOC analysts a structured triage workflow with MITRE mapping. Deploy both — they serve different audiences.

8. GeoIP is free threat intelligence

The GeoIP Processor is zero-cost and instantly useful. Even in a pure on-premises environment, unusual geography flags VPN misconfiguration, compromised credentials, or travel you weren't expecting.

9. Forwarder deployment enables an ecosystem

Once the Datadog Forwarder Lambda is deployed, enabling CloudTrail, GuardDuty, Lambda logs, or any other AWS log source is a single S3 notification configuration — no new infrastructure needed.


CloudTrail Correlation (Cross-Service Detection)

A dedicated detection rule correlates FSxN audit events with AWS CloudTrail context. When the signal fires, the investigation message includes pre-built queries for cross-service correlation:

source:cloudtrail @userIdentity.arn:*<username>*     → IAM actions
source:guardduty                                      → GuardDuty findings
source:vpc @network.client.ip:<suspicious-ip>         → Network flows
Enter fullscreen mode Exit fullscreen mode

The rule (FSxN: Suspicious Deletion After IAM Role Assumption) fires on >30 deletions in 10 minutes and embeds a correlation checklist:

  1. Was there a recent AssumeRole or ConsoleLogin from an unusual IP?
  2. Did the user's permissions change in the last 24 hours?
  3. Is the client IP in the corporate VPN range?
  4. Are there GuardDuty findings for this account?

Note: This rule activates fully once CloudTrail logs are flowing to Datadog via the AWS Integration. The FSxN detection works immediately; CloudTrail queries return results after integration setup.


SOC Triage Runbook

SOC Triage Runbook

A 7-step Notebook guides analysts from signal to resolution:

Step Action Output
1. Signal Context Verify user, IP, time, volume Timeseries widget
2. User Verification Check AD status, role changes, maintenance windows Manual checklist
3. Impact Assessment Identify affected files and directories Top List (paths)
4. Cross-Service Correlation CloudTrail, GuardDuty, VPC Flow Logs Query templates
5. Decision Matrix Map conditions to response actions Decision table
6. Response Actions Disable account, snapshot, notify, ticket Checklist
7. Post-Incident Update thresholds, document, lessons learned Checklist

Accessible at: Notebooks → "FSxN Security Signal Triage Runbook"


Case Management

Case Management — FSXN Project

Cases provide structured investigation tracking across the SOC team:

  • Project: FSXN — All FSx for ONTAP security investigations
  • Case creation: Manual from Security Signals, or auto-create via Workflow
  • Priority levels: P1 (critical mass deletion) through P4 (informational)
  • Lifecycle: Open → In Progress → Resolved → Closed

When a critical signal fires, the workflow creates a case automatically with:

  • Signal link and context
  • Affected user and IP
  • Investigation notebook link
  • Response checklist

Threat Intelligence (GeoIP Enrichment)

The Log Pipeline's GeoIP Processor automatically enriches every FSxN log with geographic data from the client_ip field:

@client_ip: 10.0.5.99 → @network.client.geoip: { country: "JP", city: "Tokyo", ... }
Enter fullscreen mode Exit fullscreen mode

A detection rule flags access from unexpected countries:

# Rule: "FSxN: File Access from Unusual Geography"
# Query: source:fsxn -@network.client.geoip.country:JP -@network.client.geoip.country:US
# Threshold: >5 events/15min from non-JP/US IPs
Enter fullscreen mode Exit fullscreen mode

This catches compromised credentials being used from abroad without needing an external threat intelligence feed — Datadog's built-in GeoIP database handles the enrichment at ingest time.


GuardDuty Correlation

A detection rule correlates FSxN deletion events with GuardDuty findings from the same source IP:

FSxN: >10 deletions/30min from IP X
  + GuardDuty finding for IP X (UnauthorizedAccess, Recon, Trojan)
  = Critical Security Signal with full context
Enter fullscreen mode Exit fullscreen mode

The rule's investigation message includes pre-built GuardDuty queries:

source:guardduty @detail.resource.instanceDetails.networkInterfaces.privateIpAddress:{{@client_ip}}
Enter fullscreen mode Exit fullscreen mode

GuardDuty findings flow to Datadog via the same Forwarder Lambda that handles CloudTrail — no additional setup needed.


Automated Snapshot Remediation

When mass deletion is confirmed (Critical signal reviewed by analyst), the Workflow invokes fsxn-snapshot-remediation Lambda:

Security Signal (Critical) → Analyst confirms → Workflow → Lambda → ONTAP REST API → Snapshot
Enter fullscreen mode Exit fullscreen mode

The Lambda:

  1. Receives volume name, SVM, and reason from the Workflow
  2. Authenticates to ONTAP via Secrets Manager credentials
  3. Creates a timestamped snapshot: remediation_20260614_215530_mass_deletion
  4. Returns snapshot name and status for the Case record
# Lambda invocation payload (from Datadog Workflow)
{
    "volume_name": "finance_share",
    "svm_name": "ProductionSVM",
    "reason": "Mass deletion detected",
    "user": "CORP\\suspicious-user"
}
Enter fullscreen mode Exit fullscreen mode

Configuration: Set ONTAP_MGMT_IP and ONTAP_CREDENTIALS_SECRET_ARN environment variables on the Lambda to point to your FSx for ONTAP management endpoint.

TLS note: The Lambda uses cert_reqs="CERT_NONE" for ONTAP REST API calls because FSx for ONTAP uses self-signed certificates by default. In production, upload the ONTAP CA certificate to the Lambda layer (/opt/certs/ontap-ca.pem) and configure urllib3.PoolManager(ca_certs="/opt/certs/ontap-ca.pem") to validate the connection.

Snapshot Storm Prevention (Cooldown)

To prevent runaway Snapshot creation during sustained mass-deletion events:

# Before creating snapshot, check for recent remediation snapshots
existing = http.request("GET",
    f"https://{mgmt_ip}/api/storage/volumes/{vol_uuid}/snapshots"
    f"?name=remediation_*&order_by=create_time desc&max_records=1",
    headers=base_headers)

if existing.status == 200:
    snaps = json.loads(existing.data).get("records", [])
    if snaps:
        last_snap_time = snaps[0].get("create_time", "")
        # Skip if a remediation snapshot was created in the last 15 minutes
        if is_within_cooldown(last_snap_time, minutes=15):
            return {"statusCode": 200, "body": "Skipped — cooldown active"}
Enter fullscreen mode Exit fullscreen mode

ONTAP limits: Each volume supports up to 1023 Snapshots. The cooldown prevents hitting this limit during sustained attack scenarios.

Remediation Audit Trail

The Lambda invocation itself must be auditable — proving that the Snapshot was created by an authorized pipeline, not a rogue actor:

  • CloudTrail: Lambda invocation recorded with InvokedBy: workflow.datadoghq.com
  • ONTAP audit log: Snapshot creation appears as an administrative event with the API user
  • Datadog Case: Snapshot name and status recorded in the Case timeline
  • Lambda CloudWatch Logs: Full request/response logged with correlation ID

What's Next

This article covered the complete detection-to-response lifecycle. Future enhancements:

  • Datadog Threat Intel feeds — When Datadog makes the Threat Intel Indicators API available on AP1, feed internal IP reputation lists for richer enrichment
  • Multi-SVM snapshot orchestration — Extend the remediation Lambda to snapshot across multiple SVMs in parallel
  • Automated account lockout — Chain the Workflow to invoke AD lockout via Systems Manager Run Command

Top comments (0)