TL;DR
Once FSx for ONTAP audit logs land in CloudWatch Logs, a common next question is: "How do I get alerted when someone touches a specific file?" Until recently you had to build a metric filter first. As of July 1, 2026, CloudWatch Log Alarms let you create an alarm directly from a Logs Insights query — no metric filter step.
- CloudWatch Log Alarm (GA 2026-07-01): alarm straight from a log query. No metric filter.
-
How it works: a Logs Insights query matches events,
count(*)turns them into a number, and the alarm fires when the count crosses a threshold. - 5 detection presets: sensitive-path access / failed access / bulk delete / privileged-user activity / custom.
-
One-command deploy:
bash shared/scripts/deploy-log-alarm.sh(working in ~15 minutes). - Cost: ~$6.6/month (100 MB/day). Slightly higher than the metric-filter approach, but you get log lines in the notification and retroactive queries.
- E2E verified: CloudFormation deploy through state transition, in the Tokyo region.
This is Part 17 of the Serverless Observability for FSx for ONTAP series. It builds directly on the admin-audit pipeline from Part 14.
Two AWS Launches That Make This Work
This article sits on top of two recent CloudWatch launches. If you're arriving from outside this series, read these first — together they remove the EC2 syslog server and the metric-filter step that used to stand between ONTAP logs and an alert.
1. Managed syslog ingestion (June 2026) — how the logs get in
ONTAP emits its admin audit trail (and EMS events) as syslog. Historically, capturing that meant running an EC2 syslog server. In June 2026, CloudWatch Logs added managed syslog ingestion: you send syslog (RFC 5424 / RFC 3164 / Cisco FTD/ASA) over a VPC endpoint straight into a log group — no agent, no EC2.
For FSx for ONTAP, that means pointing ONTAP log-forwarding at the syslog VPC endpoint and having the admin audit log arrive in CloudWatch Logs directly.
- What's New: Amazon CloudWatch Logs supports managed syslog ingestion
- Setup guide: Setting up syslog ingestion — VPC endpoint, log group, resource policy, and the three transport options (TCP+TLS 6514 encrypted, TCP 1514 plaintext, UDP 514 best-effort)
This is the piece the previous article (Part 14 / Syslog VPCE setup) already wired up. If you followed it, your admin audit log is in CloudWatch Logs and this article's alarm has something to query.
2. CloudWatch Log Alarms (July 2026) — how you alert on them
The second launch is the subject of this article: alarms defined by a Logs Insights query rather than a metric.
- What's New: Amazon CloudWatch supports creating alarms from log queries
- Docs: Alarming on logs
Put together: ONTAP → managed syslog ingestion → CloudWatch Logs → Log Alarm → SNS. No EC2, no metric filter, no forwarding Lambda.
What Changed
Alerting on log content is a basic monitoring need. On CloudWatch it used to take a detour: create a metric filter, generate a custom metric from it, then attach an alarm to that metric — three steps. From "I want to detect this" to an alert actually firing was 15–30 minutes of setup.
Log Alarms collapse the middle steps. You give a Logs Insights query a threshold, and the alarm reads straight from the logs.
| Aspect | Before (metric filter) | Log Alarm (NEW) |
|---|---|---|
| Setup steps | 3 (filter → metric → alarm) | 1 (Log Alarm only) |
| Query flexibility | Filter pattern syntax only | Full Logs Insights syntax |
| Log lines in notification | No | Yes (up to 50) |
| Retroactive to existing logs | No | Yes |
| IaC | 2 resources | 1 resource (AWS::CloudWatch::LogAlarm) |
The part I appreciate most: the notification can include the matched log lines. When the alert lands, you already see who touched which file and what they did — you can form a first read before opening the console.
FSx for ONTAP Logs × Log Alarm
There are three FSx for ONTAP log/event families, and it's worth separating them because only some are a natural fit for Log Alarms.
| Log type | What it records | Delivery path | Log Alarm target |
|---|---|---|---|
| Admin audit log | ONTAP management ops (CLI / REST API) | Managed syslog ingestion → CloudWatch Logs | ✅ this article |
| File access audit log | NAS file/folder ops (NFS / SMB) | FSx for ONTAP S3 AP → Lambda | Separate pipeline |
| EMS events | ONTAP system events (capacity / HA / ARP) | Managed syslog ingestion → CloudWatch Logs, or EMS Webhook → Lambda | ✅ when syslog-delivered |
Admin audit log is the "who did what as an administrator" record — logins (success/failure), security login account and role changes, volume create/offline/delete, vserver config changes, and privileged operations like system node systemshell or set -privilege diagnostic.
Two ONTAP settings, don't conflate them:
security auditdecides what gets recorded (e.g.,security audit modify -cliget on -httpget on -ontapiget onto capture read/GET operations — off by default, so without it yoursensitive-file-accessquery on admin ops comes up empty), whilecluster log-forwardingdecides where it's sent. The exact commands are in the Syslog VPCE setup guide. Notecluster log-forwardingsupports multiple destinations, so you can add CloudWatch alongside an existing on-prem SIEM without cutting over.
File access audit log is the "which user did what to which file" record, enabled via vserver audit create and emitted in Windows Security Event format (EVTX / XML). It's the right source for sensitive-folder access and mass-delete detection — but to use it with a Log Alarm you'd need a separate pipeline to land EVTX/XML into CloudWatch Logs. In this project, that content flows via the FSx for ONTAP S3 Access Point → Lambda path to each vendor instead.
⚠️ Critical scoping (read before you trust a preset name): the presets in this article run against the admin audit log (
/syslog/fsxn-admin-audit). That log contains management-plane operations — it does not see end-user file I/O over NFS/SMB. So on this log group:
bulk-delete-operationsdetects admin-plane destructive ops (Snapshot delete,volume offline/delete) — not ransomware encrypting user files over SMB.sensitive-file-accessmatches a path only when an admin command references it — not when a user opens that file.For user-file ransomware/mass-delete/sensitive-access detection, use ONTAP ARP (Part 3) and FPolicy / file-access audit (Part 4). The same preset works against a file-access-audit log group if you land that data in CloudWatch Logs — but on
/syslog/fsxn-admin-auditit only sees the admin plane.
EMS events (system events)
The third source is EMS (Event Management System) — ONTAP's internal system-event notifications. Where audit logs say "who did it", EMS says "what happened to the system", across seven severities (emergency → debug). Representative events this project normalizes:
| Event | Severity | Meaning |
|---|---|---|
arw.volume.state / arw.vserver.state
|
alert | ARP (Autonomous Ransomware Protection) state transition |
monitor.volume.full / wafl.vol.full
|
alert | Volume space exhaustion |
wafl.quota.hardlimit.exceeded |
error | Quota hard-limit exceeded |
cf.fsm.takeoverStarted |
alert | HA takeover started |
net.linkDown |
alert | Network link down |
EMS has two routes. This project's EMS Webhook path (HTTPS → API Gateway → Lambda) normalizes and ships events to any vendor/OTLP. Alternatively, ONTAP 9.x can forward EMS over syslog (event notification destination create -syslog ...) to the same CloudWatch Logs syslog VPC endpoint you built for the admin audit log — no separate EC2 syslog server — and then EMS becomes a Log Alarm target directly. If your need is "alert immediately on volume-full or an ARP state change", syslog-to-CloudWatch + a Log Alarm is the shortest path.
Architecture
FSx for ONTAP (ONTAP log-forwarding)
│ Syslog TCP (TLS 6514)
▼
CloudWatch Logs managed syslog ingestion (VPC endpoint)
│ → /syslog/fsxn-admin-audit
│ Scheduled Query (5 min)
▼
CloudWatch Log Alarm
│ count(*) > threshold → ALARM
▼
SNS → Email / Slack / PagerDuty (with log lines)
The actual log format
Writing a Log Alarm query is easier if you know what the line looks like. A real admin-audit line in CloudWatch Logs:
<190>Jul 2 03:17:37 FsxId...-02: [kern_audit:info:6392]
...:: FsxId...:ssh :: <source-ip>:unknown ::
FsxId...:fsx-control-plane:admin ::
system node systemshell -node * -command "top -d 1 -s 1"
:: Success
It reads like an incantation, but everything is there: when, which protocol (ssh/http), from where (source IP), who (user), what (command), and the outcome (Success/Failure). A query usually just needs one of those.
Why "Alert on a String" Actually Means "Count Then Compare"
Worth recalling: a CloudWatch alarm compares a number against a threshold. It is not natively a "fire when this string appears" trigger.
So how does a Log Alarm alert on content? It turns the string into a number first. A Logs Insights query narrows to matching events, count(*) converts them to a count, and the alarm fires when that count crosses the threshold. Read it as string → count → threshold and it clicks.
For example, to detect access to a confidential folder:
fields @timestamp, @message
| filter @message like /\/vol\/data\/confidential/
Set the aggregation to count(*) and the threshold to > 0, and a single access within the 5-minute window flips the alarm to ALARM. "Tell me the moment anyone touches the confidential folder" — done, with just that.
Deploy
The repo ships a deploy script, so a few environment variables bring up the whole set of resources.
Deploy script (recommended)
# Sensitive-file access detection (auto-creates the SNS topic)
DETECTION_TYPE=sensitive-file-access \
TARGET_PATTERN="/vol/data/confidential" \
CREATE_SNS_TOPIC=true \
SNS_TOPIC_NAME=fsxn-security-alerts \
bash shared/scripts/deploy-log-alarm.sh
The script creates the SNS topic, deploys the CloudFormation stack, and prints the alarm name and console URL.
CloudFormation template
# shared/templates/cloudwatch-log-alarm.yaml (excerpt)
Resources:
SensitiveFileAccessAlarm:
Type: AWS::CloudWatch::LogAlarm
Properties:
AlarmName: fsxn-sensitive-file-access
ComparisonOperator: GreaterThanThreshold
Threshold: 0
QueryResultsToEvaluate: 3
QueryResultsToAlarm: 1
TreatMissingData: notBreaching
ScheduledQueryConfiguration:
QueryString: |
fields @timestamp, @message
| filter @message like /\/vol\/data\/confidential/
LogGroupIdentifiers:
- /syslog/fsxn-admin-audit
ScheduledQueryRoleARN: !GetAtt ScheduledQueryRole.Arn
AggregationExpression: "count(*)"
ScheduleConfiguration:
ScheduleExpression: "rate(5 minutes)"
StartTimeOffset: 300
AlarmActions:
- !Ref AlarmSNSTopic
ActionLogLineCount: 5
ActionLogLineRoleArn: !GetAtt LogLineRole.Arn
On QueryResultsToEvaluate / QueryResultsToAlarm (M-of-N): this is the flapping control — 3 / 1 fires fast but can flap on a single spike; 3 / 2 smooths transient blips at ~one extra interval of latency. Tune this before the threshold. Per-use-case recommendations are in the setup guide.
Detection presets
The template ships five presets for common patterns. Switch DetectionType and each gets an appropriate query and threshold.
| DetectionType | Use case | Default threshold |
|---|---|---|
sensitive-file-access |
Access to a specific path | > 0 |
failed-access-attempts |
Authentication/authorization failures | > 10 |
bulk-delete-operations |
Mass deletion (ransomware signal) | > 50 |
specific-user-activity |
Privileged-user monitoring | > 0 |
custom |
Any Logs Insights query | your choice |
⚠️ Regulated environments — one thing to get right first:
ActionLogLineCountputs the matched log lines (usernames, file paths, client IPs — potentially PHI) into the SNS notification, which leaves the CloudWatch boundary. For healthcare/finance/public-sector data, default toActionLogLineCount: 0and let responders pivot into Logs Insights for the detail. This is a detection mechanism, not a compliance attestation — classify your fields and confirm APPI/FISC/ISMAP/HIPAA scope with your compliance team first. Full guidance, the regulated default, the alert-audit-trail requirements, and multi-account (StackSets) rollout are in the setup guide.
Use Cases
1. Admin-plane destructive-op detection (defense in depth)
To be precise about scope: on the admin audit log this preset catches management-plane destructive operations — a burst of Snapshot deletes, volume offline, volume delete — the kind of action an attacker with stolen admin credentials (or a mistaken operator) would take to remove recovery points. It does not see user-file encryption over SMB; that's ONTAP ARP's job at the storage layer (Part 3), with FPolicy (Part 4) for per-file operations. Layer all three: ARP for encryption, FPolicy for file ops, and this Log Alarm for admin-plane tampering (e.g., someone deleting the Snapshots you'd restore from).
DETECTION_TYPE=bulk-delete-operations \
ALARM_THRESHOLD=50 \
QUERY_RESULTS_TO_ALARM=2 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
bash shared/scripts/deploy-log-alarm.sh
| Detection layer | Catches | Method | Latency |
|---|---|---|---|
| Storage layer (ARP) | User-file encryption | ML-based entropy analysis | Real-time |
| File ops (FPolicy) | Per-file create/write/delete/rename over NFS/SMB | Protocol-level intercept | ~6 s (validated, Part 4) |
| Admin plane (Log Alarm) | Snapshot/volume destructive ops by admins | Count-based threshold on admin audit log | ~5 min |
Three different vantage points on the same attack: ARP sees the encryption, FPolicy sees the file operations, and this Log Alarm sees an admin deleting the Snapshots you'd recover from. Layering them means whatever slips past one is more likely caught by another.
In ATT&CK terms, that admin-plane view is T1490 Inhibit System Recovery — an attacker deleting your restore points so you can't roll back the encryption ARP detects (T1486). Two techniques, two controls: detect the Snapshot deletion here, and prevent it with SnapLock (WORM Snapshots that can't be deleted before expiry) so your recovery points survive the attempt. The full ATT&CK mapping, tamper-resistance guidance (who can delete the alarm and how to guard it), and a one-slide coverage map are in the setup guide.
Ops note (baseline first): Scheduled bulk operations — nightly backups, batch ETL, archive cleanups — can legitimately exceed a "50 deletes / 5 min" threshold and page on-call for nothing. Baseline your normal delete volume for a few days without an alarm action, then set the threshold above your routine peak (the exact baseline query is in the setup guide). This mirrors the ARP learning-period caveat from Part 3.
2. Compliance: notify on regulated-data access
For finance or healthcare, where "any touch of this data must be recorded and notified", a simple per-path detection works.
DETECTION_TYPE=sensitive-file-access \
TARGET_PATTERN="/vol/finance/" \
ALARM_THRESHOLD=0 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
bash shared/scripts/deploy-log-alarm.sh
Alert-fatigue note: A
> 0threshold on an actively-used path pages on every access and quickly becomes noise. Reserve> 0for genuinely restricted paths (break-glass directories, quarantined data). For paths with legitimate regular access, prefer a rate threshold (e.g., access from an unexpected principal, or volume above a baseline) and route to a ticket/Slack channel rather than a pager.
3. Privileged-user monitoring
Keeping a record of every admin-account action uses the same mechanism.
DETECTION_TYPE=specific-user-activity \
TARGET_PATTERN="fsxadmin" \
ALARM_THRESHOLD=0 \
SNS_TOPIC_ARN=<YOUR_SNS_ARN> \
bash shared/scripts/deploy-log-alarm.sh
E2E Validation (Tokyo Region)
Because it's a brand-new feature, I was skeptical it would behave as documented. So I deployed the template in a real Tokyo-region environment and drove the alarm through a state transition.
| Item | Result |
|---|---|
| CloudFormation deploy | ✅ Success |
| IAM role auto-creation | ✅ ScheduledQueryRole + LogLineRole |
| Scheduled Query execution | ✅ INSUFFICIENT_DATA → OK transition confirmed |
| Console display | ✅ Shown as "Log alarm" type |
| Logs Insights query | ✅ Matched (12 hits / 3,482 records scanned for /volume/; 472 hits for ssh) |
In the console it shows up as a new "Log alarm" type, distinct from the metric alarms you already have. In the screenshot below, look at the Type column — the new alarm is labeled "Log alarm" rather than "Metric alarm".
Opening the alarm shows the Log Alarm detail page. Note the query configuration (the Logs Insights query string, the target log group /syslog/fsxn-admin-audit, and the 5-minute schedule) and the two IAM roles CloudFormation created automatically — ScheduledQueryRole (runs the query) and LogLineRole (attaches matched log lines to the notification).
Running the query in Logs Insights hits real audit data. The bar chart at the top shows the match count per interval, and the results table below lists the matched log lines — here the /volume/ filter returned 12 matches across 3,482 records scanned.
The alarm itself stays OK while there's no matching access — the screenshot shows the state after the initial INSUFFICIENT_DATA → OK transition, with an empty history graph because nothing has crossed the threshold yet. With threshold > 0, it flips to ALARM the moment a single access to the sensitive path appears.
Audit trail of the alert itself: for compliance you also need evidence that the alarm fired and who was notified. Alarm state history is retained 90 days (fixed); for multi-year evidence, route CloudWatch Alarm state-change events (EventBridge) to S3, and rely on CloudTrail for the "who configured this detection" record. Details and retention table: setup guide.
Gotchas
| Gotcha | Workaround |
|---|---|
AWS CLI not yet supported (no put-log-alarm) |
Use CloudFormation |
| cfn-lint E3006 (resource type not yet in the spec) | Suppress per-resource (not a blanket disable); exact Metadata snippet in the setup guide
|
| First evaluation takes 5–10 min | Just wait |
| Notification includes log lines (PII/PHI risk) | Set ActionLogLineCount=0 in regulated environments (see the callout above) |
Multi-node log streams (FsxId...-01/-02) |
Query the whole log group, don't pin @logStream, or you miss half the traffic on HA takeover |
The "AWS CLI doesn't have put-log-alarm yet" is a just-after-GA reality — for now CloudFormation or the console are the only creation paths. Platform/CoE note: drift detection won't cover a resource type the CLI can't yet describe, so treat CloudFormation as the single source of truth until the API surface completes.
Cost
Rough estimate for one alarm at a 5-minute cadence.
| Logs/day | Log Alarm | Metric filter |
|---|---|---|
| 100 MB | ~$6.6/month | ~$3/month |
| 500 MB | ~$33/month | ~$3/month |
| 1 GB | ~$66/month | ~$3/month |
How the cost scales: the table above is per alarm. Each alarm runs its own Scheduled Query over the same log group, so cost is alarms × cadence × scan size — ten alarms on one log group is ~10×, not a flat add-on. Narrow queries with
filter/limitand consolidate related detections to keep it bounded. Full breakdown in the setup guide.
Honestly, Log Alarms cost more than the metric-filter approach — the Scheduled Query scans logs on each run. But the difference is recoverable elsewhere: log lines in the notification cut investigation time, it applies retroactively to existing logs, and setup is a single step.
When the premium is worth it (a line you can put in a proposal): choose Log Alarms when the alert content matters for triage (you want the matched lines in the page), when the detection query needs full Logs Insights syntax a metric filter can't express, or when you need to apply it retroactively to existing logs. Stick with metric filters for high-volume, well-understood, purely numeric signals where a few dollars × many alarms adds up. For most teams the crossover is engineer time: if the log-line context saves even one 15-minute console dig per incident, the premium pays for itself.
If volume makes cost a concern, dropping the cadence to 15 minutes cuts it to a third, and adding
limitto the query bounds the scan.
When to Use Which
Log Alarms aren't meant to replace every kind of monitoring. Depending on the need, metric filters or the OTel Collector fit better. Among this project's three delivery paths, the Log Alarm is the simplest entry point.
| Path | When | Extra infra |
|---|---|---|
| Log Alarm (this article) | Simple threshold alerts | None (self-contained in CloudWatch) |
| Lambda → vendor | Dashboards, SIEM | Forwarding Lambda |
| OTel Collector | Multi-backend, PII redaction | Forwarding Lambda + Collector (ECS Fargate) |
Concretely, the OTel Collector path's "extra infra" is two layers: three forwarding Lambdas (audit-log shipper, EMS handler, FPolicy handler) sending OTLP/HTTP, plus a resident otel/opentelemetry-collector-contrib task on ECS Fargate (with NAT Gateway for egress, Cloud Map for task IP resolution, ALB/autoscaling under load). Fan-out to Grafana/Honeycomb/Datadog and PII redaction are then a single change in the Collector's config YAML. The Log Alarm, by contrast, is the "everything stays inside CloudWatch" minimal option.
A Log Alarm is a first alert, not a full investigation tool. Let it handle "notice it first", and hand off the deep dive to vendor tooling like Datadog or Splunk. That division of labor is the realistic one.
For ONTAP operators, the setup guide covers the storage-side specifics: what
security auditcaptures vs whatlog-forwardingsends, keeping an existing SIEM alongside CloudWatch (multiple destinations), ONTAP EMS native email/SNMP as an alternative to pushing to AWS, a dead-man's-switch heartbeat alarm for when syslog delivery stops, and DR-region deployment — ONTAP operational notes.
Cleanup
# Delete all Log Alarm stacks
bash shared/scripts/cleanup-log-alarm.sh --all -y
What's Next
CloudWatch Log Alarms aren't a flashy feature. But "turn what you noticed in the logs straight into an alert" lowers the bar for setting up monitoring — you're done with one query before you'd have finished building a metric filter, staring at a metric, and wiring an alarm. It pairs well with FSx for ONTAP audit logs, answering the "I just want to notice it first" needs of storage security without extra infrastructure. There's some just-after-GA roughness (the AWS CLI hasn't caught up), but CloudFormation works today, verified in a real environment.
Upcoming in the project:
- Phase 4: Terraform module equivalents
- Phase 4: CDK construct library
- PagerDuty escalation for CloudWatch alarms — see pagerduty-escalation-guide
See the full ROADMAP.
Resources
- GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
- Template: cloudwatch-log-alarm.yaml
- Deploy script: deploy-log-alarm.sh
- Setup guide (EN): cloudwatch-log-alarm.md
- Runbook: log-alarm-triggered.md
- Syslog VPCE setup: syslog-vpce-setup-guide.md
AWS references
- What's New — CloudWatch supports creating alarms from log queries (2026-07)
- What's New — CloudWatch Logs supports managed syslog ingestion (2026-06)
- Docs — Alarming on logs
- Docs — Setting up syslog ingestion
Series Navigation
- Part 1: Why Your FSx for ONTAP Logs Deserve Better
- Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
- Part 5: Escape Vendor Lock-in with OTel Collector
- Part 13: 9 Vendors, One Architecture
- Part 14: Can You Use System Manager with FSx for ONTAP?
- Part 17: Alerting on Audit Logs with CloudWatch Log Alarms (this post)
If you deploy this, I'd love to hear how it went — drop a comment or open a GitHub issue.
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations




Top comments (0)