DEV Community

Cover image for Alerting on FSx for ONTAP Audit Logs — No Metric Filter Required, with CloudWatch Log Alarms

Alerting on FSx for ONTAP Audit Logs — No Metric Filter Required, with CloudWatch Log Alarms

TL;DR

Once FSx for ONTAP audit logs land in CloudWatch Logs, a common next question is: "How do I get alerted when someone touches a specific file?" Until recently you had to build a metric filter first. As of July 1, 2026, CloudWatch Log Alarms let you create an alarm directly from a Logs Insights query — no metric filter step.

  • CloudWatch Log Alarm (GA 2026-07-01): alarm straight from a log query. No metric filter.
  • How it works: a Logs Insights query matches events, count(*) turns them into a number, and the alarm fires when the count crosses a threshold.
  • 5 detection presets: sensitive-path access / failed access / bulk delete / privileged-user activity / custom.
  • One-command deploy: bash shared/scripts/deploy-log-alarm.sh (working in ~15 minutes).
  • Cost: ~$6.6/month (100 MB/day). Slightly higher than the metric-filter approach, but you get log lines in the notification and retroactive queries.
  • E2E verified: CloudFormation deploy through state transition, in the Tokyo region.

GitHub: Yoshiki0705/fsxn-observability-integrations

This is Part 17 of the Serverless Observability for FSx for ONTAP series. It builds directly on the admin-audit pipeline from Part 14.


Two AWS Launches That Make This Work

This article sits on top of two recent CloudWatch launches. If you're arriving from outside this series, read these first — together they remove the EC2 syslog server and the metric-filter step that used to stand between ONTAP logs and an alert.

1. Managed syslog ingestion (June 2026) — how the logs get in

ONTAP emits its admin audit trail (and EMS events) as syslog. Historically, capturing that meant running an EC2 syslog server. In June 2026, CloudWatch Logs added managed syslog ingestion: you send syslog (RFC 5424 / RFC 3164 / Cisco FTD/ASA) over a VPC endpoint straight into a log group — no agent, no EC2.

For FSx for ONTAP, that means pointing ONTAP log-forwarding at the syslog VPC endpoint and having the admin audit log arrive in CloudWatch Logs directly.

This is the piece the previous article (Part 14 / Syslog VPCE setup) already wired up. If you followed it, your admin audit log is in CloudWatch Logs and this article's alarm has something to query.

2. CloudWatch Log Alarms (July 2026) — how you alert on them

The second launch is the subject of this article: alarms defined by a Logs Insights query rather than a metric.

Put together: ONTAP → managed syslog ingestion → CloudWatch Logs → Log Alarm → SNS. No EC2, no metric filter, no forwarding Lambda.


What Changed

Alerting on log content is a basic monitoring need. On CloudWatch it used to take a detour: create a metric filter, generate a custom metric from it, then attach an alarm to that metric — three steps. From "I want to detect this" to an alert actually firing was 15–30 minutes of setup.

Log Alarms collapse the middle steps. You give a Logs Insights query a threshold, and the alarm reads straight from the logs.

Aspect Before (metric filter) Log Alarm (NEW)
Setup steps 3 (filter → metric → alarm) 1 (Log Alarm only)
Query flexibility Filter pattern syntax only Full Logs Insights syntax
Log lines in notification No Yes (up to 50)
Retroactive to existing logs No Yes
IaC 2 resources 1 resource (AWS::CloudWatch::LogAlarm)

The part I appreciate most: the notification can include the matched log lines. When the alert lands, you already see who touched which file and what they did — you can form a first read before opening the console.


FSx for ONTAP Logs × Log Alarm

There are three FSx for ONTAP log/event families, and it's worth separating them because only some are a natural fit for Log Alarms.

Log type What it records Delivery path Log Alarm target
Admin audit log ONTAP management ops (CLI / REST API) Managed syslog ingestion → CloudWatch Logs ✅ this article
File access audit log NAS file/folder ops (NFS / SMB) FSx for ONTAP S3 AP → Lambda Separate pipeline
EMS events ONTAP system events (capacity / HA / ARP) Managed syslog ingestion → CloudWatch Logs, or EMS Webhook → Lambda ✅ when syslog-delivered

Admin audit log is the "who did what as an administrator" record — logins (success/failure), security login account and role changes, volume create/offline/delete, vserver config changes, and privileged operations like system node systemshell or set -privilege diagnostic.

Two ONTAP settings, don't conflate them: security audit decides what gets recorded (e.g., security audit modify -cliget on -httpget on -ontapiget on to capture read/GET operations — off by default, so without it your sensitive-file-access query on admin ops comes up empty), while cluster log-forwarding decides where it's sent. The exact commands are in the Syslog VPCE setup guide. Note cluster log-forwarding supports multiple destinations, so you can add CloudWatch alongside an existing on-prem SIEM without cutting over.

File access audit log is the "which user did what to which file" record, enabled via vserver audit create and emitted in Windows Security Event format (EVTX / XML). It's the right source for sensitive-folder access and mass-delete detection — but to use it with a Log Alarm you'd need a separate pipeline to land EVTX/XML into CloudWatch Logs. In this project, that content flows via the FSx for ONTAP S3 Access Point → Lambda path to each vendor instead.

⚠️ Critical scoping (read before you trust a preset name): the presets in this article run against the admin audit log (/syslog/fsxn-admin-audit). That log contains management-plane operations — it does not see end-user file I/O over NFS/SMB. So on this log group:

  • bulk-delete-operations detects admin-plane destructive ops (Snapshot delete, volume offline/delete) — not ransomware encrypting user files over SMB.
  • sensitive-file-access matches a path only when an admin command references it — not when a user opens that file.

For user-file ransomware/mass-delete/sensitive-access detection, use ONTAP ARP (Part 3) and FPolicy / file-access audit (Part 4). The same preset works against a file-access-audit log group if you land that data in CloudWatch Logs — but on /syslog/fsxn-admin-audit it only sees the admin plane.

EMS events (system events)

The third source is EMS (Event Management System) — ONTAP's internal system-event notifications. Where audit logs say "who did it", EMS says "what happened to the system", across seven severities (emergency → debug). Representative events this project normalizes:

Event Severity Meaning
arw.volume.state / arw.vserver.state alert ARP (Autonomous Ransomware Protection) state transition
monitor.volume.full / wafl.vol.full alert Volume space exhaustion
wafl.quota.hardlimit.exceeded error Quota hard-limit exceeded
cf.fsm.takeoverStarted alert HA takeover started
net.linkDown alert Network link down

EMS has two routes. This project's EMS Webhook path (HTTPS → API Gateway → Lambda) normalizes and ships events to any vendor/OTLP. Alternatively, ONTAP 9.x can forward EMS over syslog (event notification destination create -syslog ...) to the same CloudWatch Logs syslog VPC endpoint you built for the admin audit log — no separate EC2 syslog server — and then EMS becomes a Log Alarm target directly. If your need is "alert immediately on volume-full or an ARP state change", syslog-to-CloudWatch + a Log Alarm is the shortest path.

Architecture

FSx for ONTAP (ONTAP log-forwarding)
    │ Syslog TCP (TLS 6514)
    ▼
CloudWatch Logs managed syslog ingestion (VPC endpoint)
    │  → /syslog/fsxn-admin-audit
    │ Scheduled Query (5 min)
    ▼
CloudWatch Log Alarm
    │ count(*) > threshold → ALARM
    ▼
SNS → Email / Slack / PagerDuty (with log lines)
Enter fullscreen mode Exit fullscreen mode

The actual log format

Writing a Log Alarm query is easier if you know what the line looks like. A real admin-audit line in CloudWatch Logs:

<190>Jul  2 03:17:37 FsxId...-02: [kern_audit:info:6392]
...:: FsxId...:ssh :: <source-ip>:unknown ::
FsxId...:fsx-control-plane:admin ::
system node systemshell -node * -command "top -d 1 -s 1"
:: Success
Enter fullscreen mode Exit fullscreen mode

It reads like an incantation, but everything is there: when, which protocol (ssh/http), from where (source IP), who (user), what (command), and the outcome (Success/Failure). A query usually just needs one of those.


Why "Alert on a String" Actually Means "Count Then Compare"

Worth recalling: a CloudWatch alarm compares a number against a threshold. It is not natively a "fire when this string appears" trigger.

So how does a Log Alarm alert on content? It turns the string into a number first. A Logs Insights query narrows to matching events, count(*) converts them to a count, and the alarm fires when that count crosses the threshold. Read it as string → count → threshold and it clicks.

For example, to detect access to a confidential folder:

fields @timestamp, @message
| filter @message like /\/vol\/data\/confidential/
Enter fullscreen mode Exit fullscreen mode

Set the aggregation to count(*) and the threshold to > 0, and a single access within the 5-minute window flips the alarm to ALARM. "Tell me the moment anyone touches the confidential folder" — done, with just that.


Deploy

The repo ships a deploy script, so a few environment variables bring up the whole set of resources.

Deploy script (recommended)

# Sensitive-file access detection (auto-creates the SNS topic)
DETECTION_TYPE=sensitive-file-access \
TARGET_PATTERN="/vol/data/confidential" \
CREATE_SNS_TOPIC=true \
SNS_TOPIC_NAME=fsxn-security-alerts \
  bash shared/scripts/deploy-log-alarm.sh
Enter fullscreen mode Exit fullscreen mode

The script creates the SNS topic, deploys the CloudFormation stack, and prints the alarm name and console URL.

CloudFormation template

# shared/templates/cloudwatch-log-alarm.yaml (excerpt)
Resources:
  SensitiveFileAccessAlarm:
    Type: AWS::CloudWatch::LogAlarm
    Properties:
      AlarmName: fsxn-sensitive-file-access
      ComparisonOperator: GreaterThanThreshold
      Threshold: 0
      QueryResultsToEvaluate: 3
      QueryResultsToAlarm: 1
      TreatMissingData: notBreaching
      ScheduledQueryConfiguration:
        QueryString: |
          fields @timestamp, @message
          | filter @message like /\/vol\/data\/confidential/
        LogGroupIdentifiers:
          - /syslog/fsxn-admin-audit
        ScheduledQueryRoleARN: !GetAtt ScheduledQueryRole.Arn
        AggregationExpression: "count(*)"
        ScheduleConfiguration:
          ScheduleExpression: "rate(5 minutes)"
          StartTimeOffset: 300
      AlarmActions:
        - !Ref AlarmSNSTopic
      ActionLogLineCount: 5
      ActionLogLineRoleArn: !GetAtt LogLineRole.Arn
Enter fullscreen mode Exit fullscreen mode

On QueryResultsToEvaluate / QueryResultsToAlarm (M-of-N): this is the flapping control — 3 / 1 fires fast but can flap on a single spike; 3 / 2 smooths transient blips at ~one extra interval of latency. Tune this before the threshold. Per-use-case recommendations are in the setup guide.

Detection presets

The template ships five presets for common patterns. Switch DetectionType and each gets an appropriate query and threshold.

DetectionType Use case Default threshold
sensitive-file-access Access to a specific path > 0
failed-access-attempts Authentication/authorization failures > 10
bulk-delete-operations Mass deletion (ransomware signal) > 50
specific-user-activity Privileged-user monitoring > 0
custom Any Logs Insights query your choice

⚠️ Regulated environments — one thing to get right first: ActionLogLineCount puts the matched log lines (usernames, file paths, client IPs — potentially PHI) into the SNS notification, which leaves the CloudWatch boundary. For healthcare/finance/public-sector data, default to ActionLogLineCount: 0 and let responders pivot into Logs Insights for the detail. This is a detection mechanism, not a compliance attestation — classify your fields and confirm APPI/FISC/ISMAP/HIPAA scope with your compliance team first. Full guidance, the regulated default, the alert-audit-trail requirements, and multi-account (StackSets) rollout are in the setup guide.


Use Cases

1. Admin-plane destructive-op detection (defense in depth)

To be precise about scope: on the admin audit log this preset catches management-plane destructive operations — a burst of Snapshot deletes, volume offline, volume delete — the kind of action an attacker with stolen admin credentials (or a mistaken operator) would take to remove recovery points. It does not see user-file encryption over SMB; that's ONTAP ARP's job at the storage layer (Part 3), with FPolicy (Part 4) for per-file operations. Layer all three: ARP for encryption, FPolicy for file ops, and this Log Alarm for admin-plane tampering (e.g., someone deleting the Snapshots you'd restore from).

DETECTION_TYPE=bulk-delete-operations \
ALARM_THRESHOLD=50 \
QUERY_RESULTS_TO_ALARM=2 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
  bash shared/scripts/deploy-log-alarm.sh
Enter fullscreen mode Exit fullscreen mode
Detection layer Catches Method Latency
Storage layer (ARP) User-file encryption ML-based entropy analysis Real-time
File ops (FPolicy) Per-file create/write/delete/rename over NFS/SMB Protocol-level intercept ~6 s (validated, Part 4)
Admin plane (Log Alarm) Snapshot/volume destructive ops by admins Count-based threshold on admin audit log ~5 min

Three different vantage points on the same attack: ARP sees the encryption, FPolicy sees the file operations, and this Log Alarm sees an admin deleting the Snapshots you'd recover from. Layering them means whatever slips past one is more likely caught by another.

In ATT&CK terms, that admin-plane view is T1490 Inhibit System Recovery — an attacker deleting your restore points so you can't roll back the encryption ARP detects (T1486). Two techniques, two controls: detect the Snapshot deletion here, and prevent it with SnapLock (WORM Snapshots that can't be deleted before expiry) so your recovery points survive the attempt. The full ATT&CK mapping, tamper-resistance guidance (who can delete the alarm and how to guard it), and a one-slide coverage map are in the setup guide.

Ops note (baseline first): Scheduled bulk operations — nightly backups, batch ETL, archive cleanups — can legitimately exceed a "50 deletes / 5 min" threshold and page on-call for nothing. Baseline your normal delete volume for a few days without an alarm action, then set the threshold above your routine peak (the exact baseline query is in the setup guide). This mirrors the ARP learning-period caveat from Part 3.

2. Compliance: notify on regulated-data access

For finance or healthcare, where "any touch of this data must be recorded and notified", a simple per-path detection works.

DETECTION_TYPE=sensitive-file-access \
TARGET_PATTERN="/vol/finance/" \
ALARM_THRESHOLD=0 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
  bash shared/scripts/deploy-log-alarm.sh
Enter fullscreen mode Exit fullscreen mode

Alert-fatigue note: A > 0 threshold on an actively-used path pages on every access and quickly becomes noise. Reserve > 0 for genuinely restricted paths (break-glass directories, quarantined data). For paths with legitimate regular access, prefer a rate threshold (e.g., access from an unexpected principal, or volume above a baseline) and route to a ticket/Slack channel rather than a pager.

3. Privileged-user monitoring

Keeping a record of every admin-account action uses the same mechanism.

DETECTION_TYPE=specific-user-activity \
TARGET_PATTERN="fsxadmin" \
ALARM_THRESHOLD=0 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
  bash shared/scripts/deploy-log-alarm.sh
Enter fullscreen mode Exit fullscreen mode

E2E Validation (Tokyo Region)

Because it's a brand-new feature, I was skeptical it would behave as documented. So I deployed the template in a real Tokyo-region environment and drove the alarm through a state transition.

Item Result
CloudFormation deploy ✅ Success
IAM role auto-creation ✅ ScheduledQueryRole + LogLineRole
Scheduled Query execution ✅ INSUFFICIENT_DATA → OK transition confirmed
Console display ✅ Shown as "Log alarm" type
Logs Insights query ✅ Matched (12 hits / 3,482 records scanned for /volume/; 472 hits for ssh)

In the console it shows up as a new "Log alarm" type, distinct from the metric alarms you already have. In the screenshot below, look at the Type column — the new alarm is labeled "Log alarm" rather than "Metric alarm".

CloudWatch Alarms list showing the new

Opening the alarm shows the Log Alarm detail page. Note the query configuration (the Logs Insights query string, the target log group /syslog/fsxn-admin-audit, and the 5-minute schedule) and the two IAM roles CloudFormation created automatically — ScheduledQueryRole (runs the query) and LogLineRole (attaches matched log lines to the notification).

Log alarm detail page showing the Logs Insights query configuration, target log group, 5-minute schedule, and the auto-created ScheduledQueryRole and LogLineRole

Running the query in Logs Insights hits real audit data. The bar chart at the top shows the match count per interval, and the results table below lists the matched log lines — here the /volume/ filter returned 12 matches across 3,482 records scanned.

Logs Insights query result for the /volume/ filter — 12 matches over 3,482 records scanned, with the match-count bar chart above the results table

The alarm itself stays OK while there's no matching access — the screenshot shows the state after the initial INSUFFICIENT_DATA → OK transition, with an empty history graph because nothing has crossed the threshold yet. With threshold > 0, it flips to ALARM the moment a single access to the sensitive path appears.

Log alarm detail page in state OK, showing the INSUFFICIENT_DATA to OK transition and a flat history graph below the threshold

Audit trail of the alert itself: for compliance you also need evidence that the alarm fired and who was notified. Alarm state history is retained 90 days (fixed); for multi-year evidence, route CloudWatch Alarm state-change events (EventBridge) to S3, and rely on CloudTrail for the "who configured this detection" record. Details and retention table: setup guide.

Gotchas

Gotcha Workaround
AWS CLI not yet supported (no put-log-alarm) Use CloudFormation
cfn-lint E3006 (resource type not yet in the spec) Suppress per-resource (not a blanket disable); exact Metadata snippet in the setup guide
First evaluation takes 5–10 min Just wait
Notification includes log lines (PII/PHI risk) Set ActionLogLineCount=0 in regulated environments (see the callout above)
Multi-node log streams (FsxId...-01/-02) Query the whole log group, don't pin @logStream, or you miss half the traffic on HA takeover

The "AWS CLI doesn't have put-log-alarm yet" is a just-after-GA reality — for now CloudFormation or the console are the only creation paths. Platform/CoE note: drift detection won't cover a resource type the CLI can't yet describe, so treat CloudFormation as the single source of truth until the API surface completes.


Cost

Rough estimate for one alarm at a 5-minute cadence.

Logs/day Log Alarm Metric filter
100 MB ~$6.6/month ~$3/month
500 MB ~$33/month ~$3/month
1 GB ~$66/month ~$3/month

How the cost scales: the table above is per alarm. Each alarm runs its own Scheduled Query over the same log group, so cost is alarms × cadence × scan size — ten alarms on one log group is ~10×, not a flat add-on. Narrow queries with filter/limit and consolidate related detections to keep it bounded. Full breakdown in the setup guide.

Honestly, Log Alarms cost more than the metric-filter approach — the Scheduled Query scans logs on each run. But the difference is recoverable elsewhere: log lines in the notification cut investigation time, it applies retroactively to existing logs, and setup is a single step.

When the premium is worth it (a line you can put in a proposal): choose Log Alarms when the alert content matters for triage (you want the matched lines in the page), when the detection query needs full Logs Insights syntax a metric filter can't express, or when you need to apply it retroactively to existing logs. Stick with metric filters for high-volume, well-understood, purely numeric signals where a few dollars × many alarms adds up. For most teams the crossover is engineer time: if the log-line context saves even one 15-minute console dig per incident, the premium pays for itself.

If volume makes cost a concern, dropping the cadence to 15 minutes cuts it to a third, and adding limit to the query bounds the scan.


When to Use Which

Log Alarms aren't meant to replace every kind of monitoring. Depending on the need, metric filters or the OTel Collector fit better. Among this project's three delivery paths, the Log Alarm is the simplest entry point.

Path When Extra infra
Log Alarm (this article) Simple threshold alerts None (self-contained in CloudWatch)
Lambda → vendor Dashboards, SIEM Forwarding Lambda
OTel Collector Multi-backend, PII redaction Forwarding Lambda + Collector (ECS Fargate)

Concretely, the OTel Collector path's "extra infra" is two layers: three forwarding Lambdas (audit-log shipper, EMS handler, FPolicy handler) sending OTLP/HTTP, plus a resident otel/opentelemetry-collector-contrib task on ECS Fargate (with NAT Gateway for egress, Cloud Map for task IP resolution, ALB/autoscaling under load). Fan-out to Grafana/Honeycomb/Datadog and PII redaction are then a single change in the Collector's config YAML. The Log Alarm, by contrast, is the "everything stays inside CloudWatch" minimal option.

A Log Alarm is a first alert, not a full investigation tool. Let it handle "notice it first", and hand off the deep dive to vendor tooling like Datadog or Splunk. That division of labor is the realistic one.

For ONTAP operators, the setup guide covers the storage-side specifics: what security audit captures vs what log-forwarding sends, keeping an existing SIEM alongside CloudWatch (multiple destinations), ONTAP EMS native email/SNMP as an alternative to pushing to AWS, a dead-man's-switch heartbeat alarm for when syslog delivery stops, and DR-region deployment — ONTAP operational notes.


Cleanup

# Delete all Log Alarm stacks
bash shared/scripts/cleanup-log-alarm.sh --all -y
Enter fullscreen mode Exit fullscreen mode

What's Next

CloudWatch Log Alarms aren't a flashy feature. But "turn what you noticed in the logs straight into an alert" lowers the bar for setting up monitoring — you're done with one query before you'd have finished building a metric filter, staring at a metric, and wiring an alarm. It pairs well with FSx for ONTAP audit logs, answering the "I just want to notice it first" needs of storage security without extra infrastructure. There's some just-after-GA roughness (the AWS CLI hasn't caught up), but CloudFormation works today, verified in a real environment.

Upcoming in the project:

  • Phase 4: Terraform module equivalents
  • Phase 4: CDK construct library
  • PagerDuty escalation for CloudWatch alarms — see pagerduty-escalation-guide

See the full ROADMAP.

Resources

AWS references

Series Navigation


If you deploy this, I'd love to hear how it went — drop a comment or open a GitHub issue.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Top comments (0)