TL;DR
The existing AWS Blog approach ships FSx for ONTAP audit logs to Splunk via two EC2 instances (syslog-ng + Universal Forwarder). We replaced it with a single Lambda function — same Splunk index, same SPL queries, 90% AWS infrastructure cost reduction.
[Before] FSx for ONTAP → syslog-ng (EC2) → Splunk UF (EC2) → Splunk
Monthly AWS infra cost: ~$66 (2× t3.medium + EBS)
Ops burden: OS patching, agent updates, scaling
[After] FSx for ONTAP → S3 Access Point → Lambda → Splunk HEC
Monthly AWS infra cost: ~$6 (Lambda + EventBridge)
Ops burden: Zero (managed services only)
Important: The 90% cost reduction refers to AWS infrastructure costs only (EC2/Lambda/EventBridge). Splunk platform licensing costs remain unchanged regardless of the delivery method.
This is Part 8 of the Serverless Observability for FSx for ONTAP series.
The Problem with EC2-Based Splunk Integration
The AWS Blog's architecture works, but it comes with operational overhead:
| Concern | EC2-Based | Serverless |
|---|---|---|
| Monthly cost | ~$66 fixed | ~$6 pay-per-use |
| OS patching | Monthly | None |
| Agent updates | Manual (UF + syslog-ng) | None |
| Scaling | Manual instance resize | Automatic (Lambda concurrency) |
| Availability | Single AZ (unless you add redundancy) | Multi-AZ by default |
| Time to deploy | Hours (provision + configure) | 30 minutes (CloudFormation) |
If you're already running this EC2 pattern and want to modernize, this article shows you how — with a parallel deployment strategy that ensures zero data loss during cutover.
Architecture
┌──────────────────────────────────────────────────────────┐
│ FSx for ONTAP │
│ │
│ Audit Volume ──→ S3 Access Point │
│ │ │
│ ▼ │
│ EventBridge Scheduler (rate: 5 min) │
│ │ │
│ ▼ │
│ Lambda (Python 3.12) │
│ • Reads audit logs via S3 AP │
│ • Parses JSON/EVTX │
│ • Formats as Splunk HEC events │
│ • Sends with Authorization: Splunk <token> │
│ • Checkpoints in SSM Parameter Store │
│ │ │
│ ▼ │
│ Splunk HEC │
│ https://<splunk>:8088/services/collector/event │
│ Response: {"text":"Success","code":0} │
│ │
│ SPL: index=fsxn_audit sourcetype=fsxn:ontap:audit │
└──────────────────────────────────────────────────────────┘
High-Volume Alternative: Firehose Path
For sustained >1000 events/sec, use Kinesis Data Firehose with its built-in Splunk destination:
FSx for ONTAP → S3 AP → Lambda (transform) → Kinesis Data Firehose → Splunk HEC
A separate template-firehose.yaml is provided for this path.
Migration Strategy (Zero Data Loss)
Phase 1: Parallel Deployment (Day 1-3)
Deploy the serverless stack alongside the existing EC2 pipeline. Use a separate Splunk index for validation:
aws cloudformation deploy \
--template-file integrations/splunk-serverless/template.yaml \
--stack-name fsxn-splunk-integration \
--parameter-overrides \
S3AccessPointArn=<S3_AP_ARN> \
SplunkHecTokenSecretArn=<SECRET_ARN> \
SplunkHecEndpoint=https://splunk.example.com:8088 \
S3BucketName=<BUCKET> \
SplunkIndex=fsxn_audit_serverless \
--capabilities CAPABILITY_IAM
Compare events between old and new pipelines for 48 hours:
| stats count by index
| where index IN ("fsxn_audit", "fsxn_audit_serverless")
Phase 2: Cutover (Day 4-5)
Once event parity is confirmed:
- Update the stack to use the production index (
fsxn_audit) - Stop the syslog-ng and UF services on EC2 (don't terminate yet)
- Monitor for 24 hours
Phase 3: Cleanup (Day 7+)
# Terminate EC2 instances
# Remove security groups, IAM roles, EBS volumes
# Delete old CloudFormation/Terraform resources
What Changes for Splunk Users
Unchanged ✅
- Index name and sourcetype (configurable)
- SPL queries — same field names
- Dashboards and saved searches
- Alert rules
Changed ⚠️
-
hostfield: EC2 hostname → SVM name -
sourcefield: syslog path →fsxn-observability - Delivery latency: near-real-time (syslog) → polling interval (default 5 min)
HEC Event Format
{
"time": 1716508800,
"host": "svm-prod-01",
"source": "fsxn-observability",
"sourcetype": "fsxn:ontap:audit",
"index": "fsxn_audit",
"event": {
"event_type": "4663",
"user": "admin@corp.local",
"operation": "ReadData",
"path": "/vol/data/report.pdf",
"result": "Success",
"client_ip": "10.0.1.50"
}
}
SPL Query Examples
# Failed access attempts
index=fsxn_audit sourcetype=fsxn:ontap:audit result=Failure
| stats count by user, path
| sort -count
# Operations timeline
index=fsxn_audit sourcetype=fsxn:ontap:audit
| timechart span=5m count by operation
# Top users
index=fsxn_audit sourcetype=fsxn:ontap:audit
| stats count by user
| sort -count
| head 20
# Specific user investigation
index=fsxn_audit sourcetype=fsxn:ontap:audit user="admin@corp.local"
| table _time, operation, path, result, client_ip
Cost Comparison
| Component | EC2-Based (monthly) | Serverless (monthly) | Savings |
|---|---|---|---|
| EC2 instances (2× t3.medium) | $60 | $0 | 100% |
| EBS volumes (2× 20GB) | $6 | $0 | 100% |
| Lambda | $0 | ~$5 | — |
| EventBridge Scheduler | $0 | ~$0.01 | — |
| Secrets Manager | $0 | ~$0.40 | — |
| Total | $66 | $6 | 91% |
Note: EC2 cost assumes 2× t3.medium (as per the AWS Blog reference architecture). Actual EC2 costs vary by instance type and region. Splunk Cloud licensing costs are contract-dependent and may differ significantly from list pricing.
Network Considerations
| Splunk Deployment | Lambda Config | Notes |
|---|---|---|
| Splunk Cloud (public HEC) | Lambda outside VPC | Simplest |
| Splunk Enterprise (private VPC) | Lambda in VPC + NAT | Same VPC as Splunk |
| Splunk Cloud (PrivateLink) | Lambda in VPC + VPC Endpoint | Most secure |
⚠️ VerifySSL: Set to
truein production. Only usefalsefor self-signed certs in dev environments.
Rollback Plan
If issues are discovered after cutover:
- Start the stopped EC2 instances (syslog-ng + UF)
- Verify syslog-ng is receiving events
- Delete the serverless CloudFormation stack
- Investigate and resolve before re-attempting
The serverless Lambda uses checkpointing — no events are lost during the overlap period (brief duplicates are possible).
What's Next
-
Firehose path: For high-volume logs (>1000 events/sec), use
template-firehose.yaml -
HEC Acknowledgment (
useACK): For Level 2+, enable HEC indexer acknowledgment to guarantee at-least-once delivery. Lambda waits for ack before advancing checkpoint -
CIM compliance: Map fields to Splunk's Common Information Model (
AuthenticationorChangedata model) for compatibility with Splunk Enterprise Security correlation searches -
Index pre-creation: The
fsxn_auditindex must be created before first ingestion (Splunk Cloud: Admin Console; Enterprise:indexes.conf) - EMS webhooks: Real-time ARP ransomware detection alerts
- FPolicy: Sub-second file operation streaming
- Production Readiness: Progress from Level 1 (this Quick Start) to Level 4 (Enterprise) — see the Pipeline SLO Definitions
Production Readiness
This integration follows the project's Production Readiness Levels:
| Level | What You Get | Go/No-Go to Next |
|---|---|---|
| Level 1 (this Quick Start) | Audit poller + DLQ | Logs arrive, checkpoint advances, DLQ empty 24h |
| Level 2 | + Splunk dashboards + alerts | SLOs met 7 days, security review done |
| Level 3 | + DynamoDB ledger + poison-pill | SLOs met 30 days, compliance pack |
| Level 4 | + OTel Collector + redaction | Multi-backend, PII redaction, DR tested |
Data classification: Splunk receives
userandpathfields (PII/sensitive). For Splunk Cloud, data is processed in the vendor's infrastructure. For self-hosted Splunk Enterprise, data stays in your VPC. See Data Classification Guide for field-by-field PII classification and handling patterns.
Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook
Resources
- GitHub: fsxn-observability-integrations/integrations/splunk-serverless
- Migration Guide (detailed)
- AWS Blog: EC2-based approach
- Splunk HEC Documentation
- Pipeline SLO Definitions
- Data Classification Guide
Series Navigation
- Part 1: Why Your FSx for ONTAP Logs Deserve Better
- Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
- Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
- Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
- Part 5: Escape Vendor Lock-in with OTel Collector
- Part 6: Direct-to-Grafana: Shipping Logs via OTLP Gateway
- Part 7: Ship FSx for ONTAP Audit Logs to New Relic via Serverless Lambda Pipeline
- Part 8: EC2 to Serverless: Modernizing Splunk Integration (this post)
- Part 9: Data Sovereignty with Elastic
- Part 10: High-Cardinality Analysis with Honeycomb
- Part 11: AI-Powered Root Cause with Dynatrace
- Part 12: JP Region with Sumo Logic
- Part 13: 9 Vendors, One Architecture: Lessons Learned
Questions about the Splunk migration or serverless HEC delivery? Drop a comment below.
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
Top comments (0)