DEV Community

Cover image for EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration

EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration

TL;DR

The existing AWS Blog approach ships FSx for ONTAP audit logs to Splunk via two EC2 instances (syslog-ng + Universal Forwarder). We replaced it with a single Lambda function — same Splunk index, same SPL queries, 90% AWS infrastructure cost reduction.

[Before] FSx for ONTAP → syslog-ng (EC2) → Splunk UF (EC2) → Splunk
         Monthly AWS infra cost: ~$66 (2× t3.medium + EBS)
         Ops burden: OS patching, agent updates, scaling

[After]  FSx for ONTAP → S3 Access Point → Lambda → Splunk HEC
         Monthly AWS infra cost: ~$6 (Lambda + EventBridge)
         Ops burden: Zero (managed services only)
Enter fullscreen mode Exit fullscreen mode

Important: The 90% cost reduction refers to AWS infrastructure costs only (EC2/Lambda/EventBridge). Splunk platform licensing costs remain unchanged regardless of the delivery method.

This is Part 8 of the Serverless Observability for FSx for ONTAP series.


The Problem with EC2-Based Splunk Integration

The AWS Blog's architecture works, but it comes with operational overhead:

Concern EC2-Based Serverless
Monthly cost ~$66 fixed ~$6 pay-per-use
OS patching Monthly None
Agent updates Manual (UF + syslog-ng) None
Scaling Manual instance resize Automatic (Lambda concurrency)
Availability Single AZ (unless you add redundancy) Multi-AZ by default
Time to deploy Hours (provision + configure) 30 minutes (CloudFormation)

If you're already running this EC2 pattern and want to modernize, this article shows you how — with a parallel deployment strategy that ensures zero data loss during cutover.

Architecture

┌──────────────────────────────────────────────────────────┐
│ FSx for ONTAP                                            │
│                                                          │
│  Audit Volume ──→ S3 Access Point                        │
│                        │                                 │
│                        ▼                                 │
│  EventBridge Scheduler (rate: 5 min)                     │
│                        │                                 │
│                        ▼                                 │
│  Lambda (Python 3.12)                                    │
│    • Reads audit logs via S3 AP                          │
│    • Parses JSON/EVTX                                    │
│    • Formats as Splunk HEC events                        │
│    • Sends with Authorization: Splunk <token>            │
│    • Checkpoints in SSM Parameter Store                  │
│                        │                                 │
│                        ▼                                 │
│  Splunk HEC                                              │
│  https://<splunk>:8088/services/collector/event          │
│  Response: {"text":"Success","code":0}                   │
│                                                          │
│  SPL: index=fsxn_audit sourcetype=fsxn:ontap:audit       │
└──────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

High-Volume Alternative: Firehose Path

For sustained >1000 events/sec, use Kinesis Data Firehose with its built-in Splunk destination:

FSx for ONTAP → S3 AP → Lambda (transform) → Kinesis Data Firehose → Splunk HEC
Enter fullscreen mode Exit fullscreen mode

A separate template-firehose.yaml is provided for this path.

Migration Strategy (Zero Data Loss)

Phase 1: Parallel Deployment (Day 1-3)

Deploy the serverless stack alongside the existing EC2 pipeline. Use a separate Splunk index for validation:

aws cloudformation deploy \
  --template-file integrations/splunk-serverless/template.yaml \
  --stack-name fsxn-splunk-integration \
  --parameter-overrides \
    S3AccessPointArn=<S3_AP_ARN> \
    SplunkHecTokenSecretArn=<SECRET_ARN> \
    SplunkHecEndpoint=https://splunk.example.com:8088 \
    S3BucketName=<BUCKET> \
    SplunkIndex=fsxn_audit_serverless \
  --capabilities CAPABILITY_IAM
Enter fullscreen mode Exit fullscreen mode

Compare events between old and new pipelines for 48 hours:

| stats count by index
| where index IN ("fsxn_audit", "fsxn_audit_serverless")
Enter fullscreen mode Exit fullscreen mode

Phase 2: Cutover (Day 4-5)

Once event parity is confirmed:

  1. Update the stack to use the production index (fsxn_audit)
  2. Stop the syslog-ng and UF services on EC2 (don't terminate yet)
  3. Monitor for 24 hours

Phase 3: Cleanup (Day 7+)

# Terminate EC2 instances
# Remove security groups, IAM roles, EBS volumes
# Delete old CloudFormation/Terraform resources
Enter fullscreen mode Exit fullscreen mode

What Changes for Splunk Users

Unchanged ✅

  • Index name and sourcetype (configurable)
  • SPL queries — same field names
  • Dashboards and saved searches
  • Alert rules

Changed ⚠️

  • host field: EC2 hostname → SVM name
  • source field: syslog path → fsxn-observability
  • Delivery latency: near-real-time (syslog) → polling interval (default 5 min)

HEC Event Format

{
  "time": 1716508800,
  "host": "svm-prod-01",
  "source": "fsxn-observability",
  "sourcetype": "fsxn:ontap:audit",
  "index": "fsxn_audit",
  "event": {
    "event_type": "4663",
    "user": "admin@corp.local",
    "operation": "ReadData",
    "path": "/vol/data/report.pdf",
    "result": "Success",
    "client_ip": "10.0.1.50"
  }
}
Enter fullscreen mode Exit fullscreen mode

SPL Query Examples

# Failed access attempts
index=fsxn_audit sourcetype=fsxn:ontap:audit result=Failure
| stats count by user, path
| sort -count

# Operations timeline
index=fsxn_audit sourcetype=fsxn:ontap:audit
| timechart span=5m count by operation

# Top users
index=fsxn_audit sourcetype=fsxn:ontap:audit
| stats count by user
| sort -count
| head 20

# Specific user investigation
index=fsxn_audit sourcetype=fsxn:ontap:audit user="admin@corp.local"
| table _time, operation, path, result, client_ip
Enter fullscreen mode Exit fullscreen mode

Cost Comparison

Component EC2-Based (monthly) Serverless (monthly) Savings
EC2 instances (2× t3.medium) $60 $0 100%
EBS volumes (2× 20GB) $6 $0 100%
Lambda $0 ~$5
EventBridge Scheduler $0 ~$0.01
Secrets Manager $0 ~$0.40
Total $66 $6 91%

Note: EC2 cost assumes 2× t3.medium (as per the AWS Blog reference architecture). Actual EC2 costs vary by instance type and region. Splunk Cloud licensing costs are contract-dependent and may differ significantly from list pricing.

Network Considerations

Splunk Deployment Lambda Config Notes
Splunk Cloud (public HEC) Lambda outside VPC Simplest
Splunk Enterprise (private VPC) Lambda in VPC + NAT Same VPC as Splunk
Splunk Cloud (PrivateLink) Lambda in VPC + VPC Endpoint Most secure

⚠️ VerifySSL: Set to true in production. Only use false for self-signed certs in dev environments.

Rollback Plan

If issues are discovered after cutover:

  1. Start the stopped EC2 instances (syslog-ng + UF)
  2. Verify syslog-ng is receiving events
  3. Delete the serverless CloudFormation stack
  4. Investigate and resolve before re-attempting

The serverless Lambda uses checkpointing — no events are lost during the overlap period (brief duplicates are possible).

What's Next

  • Firehose path: For high-volume logs (>1000 events/sec), use template-firehose.yaml
  • HEC Acknowledgment (useACK): For Level 2+, enable HEC indexer acknowledgment to guarantee at-least-once delivery. Lambda waits for ack before advancing checkpoint
  • CIM compliance: Map fields to Splunk's Common Information Model (Authentication or Change data model) for compatibility with Splunk Enterprise Security correlation searches
  • Index pre-creation: The fsxn_audit index must be created before first ingestion (Splunk Cloud: Admin Console; Enterprise: indexes.conf)
  • EMS webhooks: Real-time ARP ransomware detection alerts
  • FPolicy: Sub-second file operation streaming
  • Production Readiness: Progress from Level 1 (this Quick Start) to Level 4 (Enterprise) — see the Pipeline SLO Definitions

Production Readiness

This integration follows the project's Production Readiness Levels:

Level What You Get Go/No-Go to Next
Level 1 (this Quick Start) Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2 + Splunk dashboards + alerts SLOs met 7 days, security review done
Level 3 + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack
Level 4 + OTel Collector + redaction Multi-backend, PII redaction, DR tested

Data classification: Splunk receives user and path fields (PII/sensitive). For Splunk Cloud, data is processed in the vendor's infrastructure. For self-hosted Splunk Enterprise, data stays in your VPC. See Data Classification Guide for field-by-field PII classification and handling patterns.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook

Resources

Series Navigation


Questions about the Splunk migration or serverless HEC delivery? Drop a comment below.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Top comments (0)