DEV Community

Cover image for AI-Powered Root Cause: Correlating File Access with APM via Dynatrace

AI-Powered Root Cause: Correlating File Access with APM via Dynatrace

TL;DR

We built a serverless Lambda pipeline that ships FSx for ONTAP audit logs to Dynatrace via the Log Ingest API v2. The real value: Dynatrace's Davis AI can automatically correlate file access anomalies with application performance degradation — answering "why is the app slow?" with "because 500 users hit the same NFS share simultaneously."

FSx for ONTAP → S3 Access Point → EventBridge Scheduler → Lambda → Dynatrace Log Ingest API v2
                                                                         │
                                                                         ▼
                                                                    Davis AI
                                                              ┌───────────────────┐
                                                              │ Correlates:       │
                                                              │ • File access     │
                                                              │   anomalies       │
                                                              │ • APM metrics     │
                                                              │ • Infrastructure  │
                                                              │   health          │
                                                              │                   │
                                                              │ → Root cause      │
                                                              │   in seconds      │
                                                              └───────────────────┘
Enter fullscreen mode Exit fullscreen mode

Verified on Dynatrace SaaS Trial (Tokyo-equivalent region). Logs visible in Logs Viewer within 1-2 minutes.

This is Part 11 of the Serverless Observability for FSx for ONTAP series.


Why Dynatrace for FSx for ONTAP?

Most observability tools treat storage logs as isolated data. Dynatrace is different — it builds a topology map of your entire stack and uses Davis AI to find causal relationships through time-window correlation and entity connectivity:

Scenario Without Dynatrace With Dynatrace
App latency spike "Check the logs" Davis AI detects temporal correlation: file access to /vol/data/ increased 10x within the same 5-minute window as app response time degradation, connected via topology (app → NFS mount → SVM)
Storage I/O anomaly Manual investigation Automatic correlation via shared topology entities — Davis identifies which services are affected based on entity relationships
User reports slow file access Grep through audit logs DQL query + topology view showing the full dependency path from user request to storage operation

The key differentiator: Davis AI correlates events across entities that share topology connections within overlapping time windows — not just keyword matching or manual dashboard correlation.

Architecture

┌─────────────────────────────────────────────────────────┐
│ Event Sources                                           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  EventBridge Scheduler                                  │
│  rate(5 minutes) ──→ Lambda                             │
│                       │ lists new files via             │
│                       │ S3 Access Point                 │
│                       │ (checkpoint in SSM)             │
│                       ▼                                 │
│           Dynatrace Log Ingest API v2                   │
│           (Api-Token auth)                              │
│                       │                                 │
│  EMS Webhook          │                                 │
│  ──→ API GW ──→ Lambda ─────────────┤                   │
│     (ems_handler)                   │                   │
│                                     ▼                   │
│  FPolicy                       Dynatrace                │
│  ──→ ECS Fargate ──→ SQS      (Logs Viewer,             │
│  ──→ Bridge Lambda              Davis AI,               │
│  ──→ EventBridge                DQL,                    │
│  ──→ Lambda (fpolicy_handler)   Dashboards)             │
│  ──────────────────────────────────────────────────────┤│
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Davis AI: The Correlation Engine

When you ship FSx for ONTAP logs to Dynatrace alongside your APM data, Davis AI can detect patterns like:

  1. Storage contention → App slowdown: Spike in file operations correlates with increased response times
  2. Ransomware activity → Service impact: ARP (Anti-Ransomware Protection) EMS events correlate with unusual file encryption patterns
  3. Quota exhaustion → Write failures: ONTAP quota warnings correlate with application write errors

This works because Dynatrace maps your FSx for ONTAP SVM as a custom device entity in its topology, connecting it to the applications that access it.

Quick Start (30 Minutes)

1. Create Dynatrace API Token

  1. Log in to your Dynatrace environment
  2. Go to Access Tokens (Settings → Integration → Access tokens)
  3. Create a token with scope: logs.ingest
  4. Token format: dt0c01.<TOKEN_ID>.<TOKEN_SECRET>

2. Store Credentials

aws secretsmanager create-secret \
  --name "dynatrace/fsxn-api-token" \
  --secret-string '{"api_token":"dt0c01.XXXXXXXX.YYYYYYYY"}' \
  --region ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

3. Deploy CloudFormation Stack

aws cloudformation deploy \
  --template-file integrations/dynatrace/template.yaml \
  --stack-name fsxn-dynatrace-integration \
  --parameter-overrides \
    S3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    DynatraceApiTokenSecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:dynatrace/fsxn-api-token-XXXXXX \
    DynatraceEnvUrl=https://abc12345.live.dynatrace.com \
    S3BucketName=my-fsxn-audit-bucket \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

4. Verify in Dynatrace

Navigate to LogsView logsRun query:

fetch logs
| filter log.source == "fsxn-ontap"
Enter fullscreen mode Exit fullscreen mode

Logs should appear within 1-2 minutes.

Log Entry Format

Each audit log event is shipped with structured attributes for DQL querying:

{
  "content": "{\"EventID\":\"4663\",\"UserName\":\"admin@corp.local\",...}",
  "log.source": "fsxn-ontap",
  "dt.source_entity": "CUSTOM_DEVICE-fsxn-svm-prod-01",
  "timestamp": "2026-01-15T12:00:00Z",
  "severity": "info",
  "fsxn.svm": "svm-prod-01",
  "fsxn.operation": "ReadData",
  "fsxn.user": "admin@corp.local",
  "fsxn.path": "/vol/data/file.txt",
  "fsxn.s3_key": "audit/2026/01/15/audit-001.json"
}
Enter fullscreen mode Exit fullscreen mode

The dt.source_entity field links logs to a custom device in Dynatrace's topology, enabling Davis AI correlation.

DQL Query Examples

Dynatrace Query Language (DQL) provides powerful analytics:

Basic Investigation

// All failed file access attempts (using structured attributes)
fetch logs
| filter log.source == "fsxn-ontap"
| filter fsxn.result == "Failure"
| summarize count(), by: {fsxn.user, fsxn.path}

// Top operations by volume
fetch logs
| filter log.source == "fsxn-ontap"
| summarize count(), by: {fsxn.operation}
| sort count() desc

// Access timeline for a specific SVM
fetch logs
| filter fsxn.svm == "svm-prod-01"
| makeTimeseries count(), interval: 5m
Enter fullscreen mode Exit fullscreen mode

APM Correlation Queries

// File access volume vs app response time (side-by-side)
fetch logs
| filter log.source == "fsxn-ontap"
| makeTimeseries file_ops = count(), interval: 5m

// Correlate with service metrics in a dashboard
// (Place this next to a service response time tile)

// Find users causing the most I/O during a performance incident
fetch logs
| filter log.source == "fsxn-ontap"
| filter timestamp >= now() - 1h
| summarize ops = count(), by: {fsxn.user}
| sort ops desc
| limit 10
Enter fullscreen mode Exit fullscreen mode

Security Queries

// Detect potential ransomware (mass file modifications)
fetch logs
| filter log.source == "fsxn-ontap"
| filter fsxn.operation == "WriteData" OR fsxn.operation == "Delete"
| makeTimeseries write_ops = count(), interval: 1m
| filter write_ops > 100

// After-hours access
fetch logs
| filter log.source == "fsxn-ontap"
| filter hour(timestamp) < 7 OR hour(timestamp) > 19
| summarize count(), by: {fsxn.user, fsxn.path}
Enter fullscreen mode Exit fullscreen mode

Deployment Options

Deployment URL Format Data Location
SaaS https://<env-id>.live.dynatrace.com Dynatrace-managed (region-specific)
Managed https://<your-domain>/e/<env-id> Your infrastructure
ActiveGate https://<host>:9999/e/<env-id> Your network (proxy)

For data sovereignty requirements, Dynatrace Managed or ActiveGate keeps all data within your infrastructure.

Cost Analysis

Dynatrace pricing is based on Davis Data Units (DDU):

Monthly Log Volume DDU/day (est.) Monthly DDU Cost
1 GB ~1 DDU Minimal (within base allocation)
10 GB ~10 DDU ~$25/month (at $2.50/DDU)
100 GB ~100 DDU ~$250/month
Component Monthly Cost (10 GB/month)
Lambda (5-min polling) ~$3
EventBridge Scheduler ~$1
Secrets Manager ~$1
Dynatrace DDU ~$25
Total ~$30

DDU pricing varies by contract. The 14-day trial includes generous DDU allocation for validation. Check your license terms for production estimates.

Gotchas & Lessons Learned

# Discovery Impact
1 API returns HTTP 204 on success (not 200) Lambda must treat 204 as success
2 Trial environment has 1-2 minute ingestion lag Wait before checking Logs Viewer
3 logs.ingest scope is required — ReadConfig/WriteConfig won't work Token creation must select correct scope
4 logs.read scope needed separately for API-based queries Create a second token for automation
5 Log entries older than 24 hours may be rejected Use current timestamps in test data
6 Max 1MB per request (smallest batch limit in this series) Lambda splits large batches
7 Firehose delivery requires ActiveGate (not direct to SaaS) Use Lambda direct for simplicity

Davis AI Integration Pattern

To get the most from Davis AI correlation, all three prerequisites must be in place:

  1. Ship FSx for ONTAP logs (this integration) — with dt.source_entity field set
  2. Deploy OneAgent on application hosts that access FSx for ONTAP via NFS/SMB — this creates the application-side topology
  3. Create custom device for each SVM (dt.source_entity) — this creates the storage-side topology node. Use the Entity API (POST /api/v2/entities/custom) or Settings API to pre-create the device entity before first log ingestion

Prerequisites for correlation: Davis AI correlation only activates when all three components are connected in the topology. Without OneAgent on the application hosts, Davis AI cannot establish the causal link between file access patterns and application performance. The custom device entity must use a consistent naming convention (e.g., CUSTOM_DEVICE-fsxn-{svm-name}) across all log entries.

Application (OneAgent) ──→ NFS/SMB ──→ FSx for ONTAP (SVM)
       │                                      │
       │ APM metrics                          │ Audit logs
       ▼                                      ▼
                    Dynatrace Davis AI
                    (automatic correlation)
Enter fullscreen mode Exit fullscreen mode

Production Readiness

This integration follows the project's Production Readiness Levels:

Level What You Get Go/No-Go to Next
Level 1 (this Quick Start) Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2 + DQL dashboards + alerts SLOs met 7 days, security review done
Level 3 + DynamoDB ledger + Davis AI correlation SLOs met 30 days, compliance pack
Level 4 + OTel Collector + redaction + OneAgent Multi-backend, PII redaction, full topology

Data classification: Dynatrace receives fsxn.user and fsxn.path fields (PII/sensitive). Dynatrace SaaS environments are region-specific — select a region matching your data residency requirements. For Managed/ActiveGate deployments, data stays in your infrastructure. See Data Classification Guide.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook

CloudFormation Templates

Template Purpose Key Parameters
template.yaml FSx audit log poller S3AccessPointArn, DynatraceApiTokenSecretArn, DynatraceEnvUrl
template-ems.yaml EMS webhook handler DynatraceApiTokenSecretArn, DynatraceEnvUrl
template-fpolicy.yaml FPolicy EventBridge handler DynatraceApiTokenSecretArn, DynatraceEnvUrl, EventBusName

Resources

Series Navigation


Questions about the Dynatrace integration or Davis AI correlation? Drop a comment below.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Top comments (0)