Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 30

EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration

#aws #splunk #observability #amazonfsxfornetappontap

TL;DR

The existing AWS Blog approach ships FSx for ONTAP audit logs to Splunk via two EC2 instances (syslog-ng + Universal Forwarder). We replaced it with a single Lambda function — same Splunk index, same SPL queries, 90% AWS infrastructure cost reduction.

[Before] FSx for ONTAP → syslog-ng (EC2) → Splunk UF (EC2) → Splunk
         Monthly AWS infra cost: ~$66 (2× t3.medium + EBS)
         Ops burden: OS patching, agent updates, scaling

[After]  FSx for ONTAP → S3 Access Point → Lambda → Splunk HEC
         Monthly AWS infra cost: ~$6 (Lambda + EventBridge)
         Ops burden: Zero (managed services only)

Important: The 90% cost reduction refers to AWS infrastructure costs only (EC2/Lambda/EventBridge). Splunk platform licensing costs remain unchanged regardless of the delivery method.

This is Part 8 of the Serverless Observability for FSx for ONTAP series.

The Problem with EC2-Based Splunk Integration

The AWS Blog's architecture works, but it comes with operational overhead:

Concern	EC2-Based	Serverless
Monthly cost	~$66 fixed	~$6 pay-per-use
OS patching	Monthly	None
Agent updates	Manual (UF + syslog-ng)	None
Scaling	Manual instance resize	Automatic (Lambda concurrency)
Availability	Single AZ (unless you add redundancy)	Multi-AZ by default
Time to deploy	Hours (provision + configure)	30 minutes (CloudFormation)

If you're already running this EC2 pattern and want to modernize, this article shows you how — with a parallel deployment strategy that ensures zero data loss during cutover.

Architecture

┌──────────────────────────────────────────────────────────┐
│ FSx for ONTAP                                            │
│                                                          │
│  Audit Volume ──→ S3 Access Point                        │
│                        │                                 │
│                        ▼                                 │
│  EventBridge Scheduler (rate: 5 min)                     │
│                        │                                 │
│                        ▼                                 │
│  Lambda (Python 3.12)                                    │
│    • Reads audit logs via S3 AP                          │
│    • Parses JSON/EVTX                                    │
│    • Formats as Splunk HEC events                        │
│    • Sends with Authorization: Splunk <token>            │
│    • Checkpoints in SSM Parameter Store                  │
│                        │                                 │
│                        ▼                                 │
│  Splunk HEC                                              │
│  https://<splunk>:8088/services/collector/event          │
│  Response: {"text":"Success","code":0}                   │
│                                                          │
│  SPL: index=fsxn_audit sourcetype=fsxn:ontap:audit       │
└──────────────────────────────────────────────────────────┘

High-Volume Alternative: Firehose Path

For sustained >1000 events/sec, use Kinesis Data Firehose with its built-in Splunk destination:

FSx for ONTAP → S3 AP → Lambda (transform) → Kinesis Data Firehose → Splunk HEC

A separate template-firehose.yaml is provided for this path.

Migration Strategy (Zero Data Loss)

Phase 1: Parallel Deployment (Day 1-3)

Deploy the serverless stack alongside the existing EC2 pipeline. Use a separate Splunk index for validation:

aws cloudformation deploy \
  --template-file integrations/splunk-serverless/template.yaml \
  --stack-name fsxn-splunk-integration \
  --parameter-overrides \
    S3AccessPointArn=<S3_AP_ARN> \
    SplunkHecTokenSecretArn=<SECRET_ARN> \
    SplunkHecEndpoint=https://splunk.example.com:8088 \
    S3BucketName=<BUCKET> \
    SplunkIndex=fsxn_audit_serverless \
  --capabilities CAPABILITY_IAM

Compare events between old and new pipelines for 48 hours:

| stats count by index
| where index IN ("fsxn_audit", "fsxn_audit_serverless")

Phase 2: Cutover (Day 4-5)

Once event parity is confirmed:

Update the stack to use the production index (fsxn_audit)
Stop the syslog-ng and UF services on EC2 (don't terminate yet)
Monitor for 24 hours

Phase 3: Cleanup (Day 7+)

# Terminate EC2 instances
# Remove security groups, IAM roles, EBS volumes
# Delete old CloudFormation/Terraform resources

What Changes for Splunk Users

Unchanged ✅

Index name and sourcetype (configurable)
SPL queries — same field names
Dashboards and saved searches
Alert rules

Changed ⚠️

host field: EC2 hostname → SVM name
source field: syslog path → fsxn-observability
Delivery latency: near-real-time (syslog) → polling interval (default 5 min)

HEC Event Format

{
  "time": 1716508800,
  "host": "svm-prod-01",
  "source": "fsxn-observability",
  "sourcetype": "fsxn:ontap:audit",
  "index": "fsxn_audit",
  "event": {
    "event_type": "4663",
    "user": "admin@corp.local",
    "operation": "ReadData",
    "path": "/vol/data/report.pdf",
    "result": "Success",
    "client_ip": "10.0.1.50"
  }
}

SPL Query Examples

# Failed access attempts
index=fsxn_audit sourcetype=fsxn:ontap:audit result=Failure
| stats count by user, path
| sort -count

# Operations timeline
index=fsxn_audit sourcetype=fsxn:ontap:audit
| timechart span=5m count by operation

# Top users
index=fsxn_audit sourcetype=fsxn:ontap:audit
| stats count by user
| sort -count
| head 20

# Specific user investigation
index=fsxn_audit sourcetype=fsxn:ontap:audit user="admin@corp.local"
| table _time, operation, path, result, client_ip

Cost Comparison

Component	EC2-Based (monthly)	Serverless (monthly)	Savings
EC2 instances (2× t3.medium)	$60	$0	100%
EBS volumes (2× 20GB)	$6	$0	100%
Lambda	$0	~$5	—
EventBridge Scheduler	$0	~$0.01	—
Secrets Manager	$0	~$0.40	—
Total	$66	$6	91%

Note: EC2 cost assumes 2× t3.medium (as per the AWS Blog reference architecture). Actual EC2 costs vary by instance type and region. Splunk Cloud licensing costs are contract-dependent and may differ significantly from list pricing.

Network Considerations

Splunk Deployment	Lambda Config	Notes
Splunk Cloud (public HEC)	Lambda outside VPC	Simplest
Splunk Enterprise (private VPC)	Lambda in VPC + NAT	Same VPC as Splunk
Splunk Cloud (PrivateLink)	Lambda in VPC + VPC Endpoint	Most secure

⚠️ VerifySSL: Set to true in production. Only use false for self-signed certs in dev environments.

Rollback Plan

If issues are discovered after cutover:

Start the stopped EC2 instances (syslog-ng + UF)
Verify syslog-ng is receiving events
Delete the serverless CloudFormation stack
Investigate and resolve before re-attempting

The serverless Lambda uses checkpointing — no events are lost during the overlap period (brief duplicates are possible).

What's Next

Firehose path: For high-volume logs (>1000 events/sec), use template-firehose.yaml
HEC Acknowledgment (useACK): For Level 2+, enable HEC indexer acknowledgment to guarantee at-least-once delivery. Lambda waits for ack before advancing checkpoint
CIM compliance: Map fields to Splunk's Common Information Model (Authentication or Change data model) for compatibility with Splunk Enterprise Security correlation searches
Index pre-creation: The fsxn_audit index must be created before first ingestion (Splunk Cloud: Admin Console; Enterprise: indexes.conf)
EMS webhooks: Real-time ARP ransomware detection alerts
FPolicy: Sub-second file operation streaming
Production Readiness: Progress from Level 1 (this Quick Start) to Level 4 (Enterprise) — see the Pipeline SLO Definitions

Production Readiness

This integration follows the project's Production Readiness Levels:

Level	What You Get	Go/No-Go to Next
Level 1 (this Quick Start)	Audit poller + DLQ	Logs arrive, checkpoint advances, DLQ empty 24h
Level 2	+ Splunk dashboards + alerts	SLOs met 7 days, security review done
Level 3	+ DynamoDB ledger + poison-pill	SLOs met 30 days, compliance pack
Level 4	+ OTel Collector + redaction	Multi-backend, PII redaction, DR tested

Data classification: Splunk receives user and path fields (PII/sensitive). For Splunk Cloud, data is processed in the vendor's infrastructure. For self-hosted Splunk Enterprise, data stays in your VPC. See Data Classification Guide for field-by-field PII classification and handling patterns.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook