Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 31

9 Services, One Architecture: What We Learned Shipping FSx for ONTAP Logs to Every Major Observability Platform

#aws #serverless #observability #amazonfsxfornetappontap

TL;DR

We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture:

For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead of hours, zero operational burden. Four vendors offer permanent free tiers covering most FSx for ONTAP deployments (New Relic 100 GB, Grafana Cloud 50 GB, Honeycomb 20M events, Sumo Logic 500 MB/day).

                    ┌─────────────────────────────────────────────┐
                    │         One Architecture, 9 Backends        │
                    ├─────────────────────────────────────────────┤
                    │                                             │
                    │  FSx for ONTAP ──→ S3 Access Point          │
                    │       │                                     │
                    │       ▼                                     │
                    │  EventBridge Scheduler (5 min)              │
                    │       │                                     │
                    │       ▼                                     │
                    │  Lambda (vendor-specific handler)           │
                    │       │                                     │
                    │       ├──→ Datadog (Logs API v2)            │
                    │       ├──→ New Relic (Log API v1)           │
                    │       ├──→ Splunk (HEC)                     │
                    │       ├──→ Grafana Cloud (OTLP Gateway)     │
                    │       ├──→ Elastic (Bulk API)               │
                    │       ├──→ Dynatrace (Log Ingest v2)        │
                    │       ├──→ Sumo Logic (HTTP Source)         │
                    │       ├──→ Honeycomb (Events Batch API)     │
                    │       └──→ OTel Collector (OTLP/HTTP)       │
                    │                                             │
                    └─────────────────────────────────────────────┘

12 articles, 9 vendors, 3 event sources (audit logs, EMS webhooks, FPolicy), all CloudFormation-templated, all tested with real FSx for ONTAP data. This post distills what we learned.

This is Part 13 — the series finale — of Serverless Observability for FSx for ONTAP.

The Architecture That Survived 9 Integrations

After implementing 9 vendor integrations, the core pattern remained unchanged:

def lambda_handler(event, context):
    # 1. Get cached credentials (Secrets Manager + TTL, default 5 min)
    creds = auth.get()

    # 2. List new files since checkpoint (S3 AP + SSM)
    new_keys = list_new_keys(s3_ap_arn, prefix, checkpoint)

    # 3. Read, parse, format, ship per file (vendor-specific)
    #    (Simplified — actual implementation batches events across files
    #     and respects vendor-specific batch size limits)
    for key in new_keys:
        logs = read_and_parse(key)
        payload = format_for_vendor(logs)  # Only this changes per vendor

    # 4. Ship with retry (vendor API)
        ship_to_vendor(payload, creds)

    # 5. Advance checkpoint (only after confirmed delivery)
        update_checkpoint(key)

What changes per vendor: only the formatting and HTTP call (~50-100 lines). Everything else — S3 AP access, checkpoint management, DLQ handling, credential caching, retry logic — is shared.

Cross-Vendor Comparison: The Numbers

API Characteristics

Vendor	Endpoint	Auth Model	Max Batch	Success Code	Firehose
Datadog	Logs API v2	Header (`DD-API-KEY`)	5 MB / 1000 items	200	Yes
New Relic	Log API v1	Header (`Api-Key`)	1 MB	202	Yes
Splunk	HEC	Header (`Splunk <token>`)	No hard limit	200	Yes (built-in)
Grafana	OTLP Gateway	Basic Auth (base64)	~4 MB	200	No
Elastic	Bulk API	Header (`ApiKey <b64>`)	~10 MB	200	No
Dynatrace	Log Ingest v2	Header (`Api-Token`)	1 MB	204	Via ActiveGate
Sumo Logic	HTTP Source	URL-embedded token	1 MB	200	No
Honeycomb	Events Batch	Header (`x-honeycomb-team`)	5 MB (impl: 100/batch)	200	No
OTel Collector	OTLP/HTTP	Configurable	Configurable	200	No

Cost at 10 GB/month

Vendor	Vendor Cost	AWS Infra	Total	Free Tier
Sumo Logic	$0	~$5	~$5	500 MB/day
Honeycomb	$0	~$5	~$5	20M events/month
New Relic	$0	~$5	~$5	100 GB/month
Grafana Cloud	$0	~$5	~$5	50 GB logs/month
Datadog	~$15	~$5	~$20	Logs: 14-day trial only
Dynatrace	~$25	~$5	~$30	14-day trial
Elastic Cloud	~$95	~$5	~$100	14-day trial
Splunk Cloud	~$150+	~$5	~$155+	N/A

AWS infrastructure cost is consistent across all vendors (~$5/month for Lambda + EventBridge + Secrets Manager). The vendor platform cost is the differentiator.

Data Residency

Vendor	Tokyo (JP)	US	EU	Self-Hosted
Sumo Logic	Yes	Yes	Yes	No
Elastic	Yes	Yes	Yes	Yes
Dynatrace	Yes (region-specific)	Yes	Yes	Yes (Managed)
Datadog	No	Yes	Yes	No
New Relic	No (July 2026 planned)	Yes	Yes	No
Grafana Cloud	Dedicated only	Yes	Yes	No (Alloy self-hosted)
Splunk	No	Yes	Yes	Yes
Honeycomb	No	Yes	No	No

Governance note: This table provides technical awareness for vendor selection. Grafana Cloud offers Tokyo region on Dedicated tier (not Free/Pro). Data residency alone does not constitute regulatory compliance. Evaluate your specific requirements (APPI, GDPR, FISC, ISMAP) with your compliance team. See the Retention Policy Matrix for regulation-to-vendor mapping.

Unique Strengths

Vendor	Best For
Datadog	Full-stack APM correlation, broadest feature set
New Relic	Generous free tier (100 GB), NRQL power
Splunk	Existing Splunk shops, SPL expertise, Firehose native
Grafana Cloud	OTLP-native, LogQL, open-source ecosystem
Elastic	Data sovereignty (self-hosted), ECS/SIEM, Kibana
Dynatrace	Davis AI root cause analysis, APM correlation
Sumo Logic	JP region data residency, generous free tier
Honeycomb	High-cardinality analysis (BubbleUp, Heatmaps)
OTel Collector	Multi-backend, vendor portability, redaction

Note on Grafana ecosystem: Grafana Alloy (formerly Grafana Agent) provides a Grafana-native alternative to the OpenTelemetry Collector with the same OTLP compatibility. Grafana Cloud's OTLP Gateway is available on all tiers including Free (US/EU regions only). For Tokyo data residency, Grafana Cloud Dedicated is required.

7 Patterns That Survived All 9 Integrations

1. Polling > Event-Driven (for FSx for ONTAP S3 AP)

FSx for ONTAP S3 Access Points don't support S3 Event Notifications. We evaluated CloudTrail data events as an alternative — however, CloudTrail data events for FSx for ONTAP S3 AP access are not consistently available across all configurations. The 5-minute EventBridge Scheduler poll is simpler, cheaper, and sufficient for audit log use cases where near-real-time (not real-time) delivery is acceptable.

2. Checkpoint-After-Delivery

Never advance the checkpoint before confirming vendor delivery. This single rule prevents data loss across all failure modes:

# CORRECT: checkpoint after confirmed delivery
ship_to_vendor(payload)  # Raises on failure
update_checkpoint(key)   # Only reached on success

# WRONG: checkpoint before delivery
update_checkpoint(key)   # What if ship_to_vendor fails next?
ship_to_vendor(payload)  # Data loss if this fails

3. Credential Caching with Reload-on-401

Every vendor integration uses the same SecretBackedAuth pattern: cache credentials at cold start, reload on TTL expiry or 401/403. This handles credential rotation without Lambda redeployment.

4. Reserved Concurrency = 1

The audit poller must not run concurrently (checkpoint race condition). ReservedConcurrentExecutions: 1 is the simplest guard. For higher throughput, move to DynamoDB-based per-object locking.

5. DLQ for Every Async Path

Every template includes a KMS-encrypted DLQ. In 9 integrations, the DLQ caught: vendor outages, credential expiry, malformed files, and Lambda timeouts. Without it, these failures would be silent data loss.

6. Vendor-Specific Batch Limits Matter

The biggest implementation difference across vendors is batch size handling:

Vendor	Limit	Lambda Behavior
Honeycomb	100 events	Split into chunks of 100
Dynatrace / Sumo Logic	1 MB	Measure payload size, split at boundary
Datadog	5 MB / 1000 items	Dual limit check
Elastic	~10 MB	Rarely hit with audit logs

7. OTLP as the Universal Format

If you're unsure which vendor you'll use long-term, start with OTLP. The OTel Collector integration (Part 5) proved that a single Lambda producing OTLP can feed Datadog, Grafana, and Honeycomb simultaneously — with zero code changes when adding or removing backends.

Beyond multi-backend delivery, the OTel Collector provides:

Enrichment: Resource detection, Kubernetes attributes, custom metadata injection
Sampling: Tail-based sampling for high-volume environments
Redaction: PII field removal/masking before data leaves your account (see PII Redaction Cookbook)
Format conversion: OTLP ↔ vendor-native format translation

Verified version: All OTel Collector configurations in this series were tested with OpenTelemetry Collector Contrib v0.152.0. OTel Collector has frequent releases with potential breaking changes — pin your version in production and test before upgrading.

What We'd Do Differently

Start with OTel Collector for Multi-Vendor Evaluation

If evaluating multiple vendors, deploy the OTel Collector path first. It lets you send the same data to 2-3 vendors simultaneously for comparison, without deploying separate Lambda stacks per vendor.

Define SLOs Before Building

We defined Pipeline SLOs after building all 9 integrations. In hindsight, defining "< 10 min delivery latency" and "< 0.01% data loss" upfront would have guided design decisions earlier (e.g., checkpoint granularity, retry policy).

Data Classification First

Audit logs contain PII (usernames, file paths). We documented this in the Data Classification Guide after implementation. For regulated environments, classify fields before choosing a vendor — it may eliminate options that don't support your data residency requirements.

Production Readiness Framework

After 9 integrations, we formalized a 4-level production readiness model:

Level	What	Go/No-Go to Next
Level 1: Quickstart	Audit poller + DLQ	Logs arrive, checkpoint advances, DLQ empty 24h
Level 2: Operational PoC	+ Dashboard + alerts	SLOs met 7 days, security review done
Level 3: Production	+ DynamoDB ledger + poison-pill	SLOs met 30 days, compliance pack
Level 4: Enterprise	+ OTel Collector + redaction	Multi-backend, PII redaction, DR tested

Most PoCs should target Level 2. Production deployments need Level 3. Enterprise pipelines with compliance requirements need Level 4.

Recommended transition timeline:

Level 1 → Level 2: ~1 week (add dashboards, define SLOs, validate 7-day stability)
Level 2 → Level 3: ~2-4 weeks (deploy DynamoDB ledger, implement poison-pill handling, complete security review)
Level 3 → Level 4: ~1-2 months (deploy OTel Collector, implement PII redaction, test DR failover, complete compliance evidence pack)

Full criteria: Pipeline SLO Definitions

Vendor Selection Decision Tree

Start
  |
  +-- Need JP data residency?
  |   +-- Yes -> Sumo Logic (JP) or Elastic (self-hosted in Tokyo VPC)
  |   +-- No  |
  |           v
  +-- Need self-hosted (air-gapped)?
  |   +-- Yes -> Elastic or Splunk
  |   +-- No  |
  |           v
  +-- Already have an observability platform?
  |   +-- Yes -> Use that vendor (all 9 are supported)
  |   +-- No  |
  |           v
  +-- Budget constraint (free tier needed)?
  |   +-- Yes -> Sumo Logic (500 MB/day) or Honeycomb (20M events) or New Relic (100 GB)
  |   +-- No  |
  |           v
  +-- Need AI-powered root cause analysis?
  |   +-- Yes -> Dynatrace (Davis AI)
  |   +-- No  |
  |           v
  +-- Need high-cardinality analysis?
  |   +-- Yes -> Honeycomb (BubbleUp)
  |   +-- No  |
  |           v
  +-- Need multi-backend / vendor portability?
  |   +-- Yes -> OTel Collector
  |   +-- No  |
  |           v
  +-- Default -> Datadog (broadest) or Grafana (OTLP-native, open ecosystem)

The FSx for ONTAP S3 AP Constraint That Shaped Everything

The single most impactful technical constraint: FSx for ONTAP S3 Access Points do not support S3 Event Notifications.

This one fact drove:

EventBridge Scheduler polling pattern (not event-driven)
SSM Parameter Store checkpointing (track what's been processed)
Reserved concurrency = 1 (prevent checkpoint races)
Safety threshold (stop before Lambda timeout)
MAX_KEYS_PER_RUN (bound processing per invocation)

If FSx for ONTAP S3 APs add event notification support in the future, the architecture could simplify significantly. As of May 2026, this feature is not supported, and the polling pattern is battle-tested across 9 vendors.

Cost Reality: EC2 vs Serverless

The original motivation: replace the EC2-based Splunk pattern (2x EC2 instances) with serverless.

Metric	EC2 Pattern	Serverless Pattern
Monthly AWS cost	~$66	~$5-8
OS patching	Required	None
Scaling	Manual	Automatic
Vendor support	Splunk only	9 vendors
Deploy time	Hours	30 minutes
Recovery from failure	Manual restart	Automatic (DLQ + retry)

90% cost reduction with zero operational burden. The serverless pattern wins on every dimension except one: real-time latency (EC2 syslog can be sub-second; our poller is 5-minute intervals). For audit logs, 5 minutes is acceptable. For real-time needs, use the FPolicy path (< 30 seconds).

What's Next

This series covered the foundation. The project continues with:

Phase 3 (delivered): Multi-account deployment (AWS Organizations + StackSets)
Phase 3 (delivered): DynamoDB object ledger for per-object processing state
Phase 3 (delivered): SQS buffering pattern for backpressure handling
Phase 3 (delivered): Cross-region DR with Active-Passive failover
Phase 3 (delivered): OTel Collector PII redaction cookbook (7 recipes for APPI/GDPR)
Phase 4: Terraform module equivalents
Phase 4: CDK construct library

See the full ROADMAP.

Resources

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
Pipeline SLO: docs/en/pipeline-slo.md
Data Classification: docs/en/data-classification.md
S3 AP Throughput Benchmark: docs/en/s3ap-throughput-benchmark.md
Vendor Comparison: docs/en/vendor-comparison.md
Partner FAQ: docs/en/partner-faq.md
Workshop Guide: docs/en/workshop-hands-on-half-day.md
Compliance Evidence Pack: docs/en/compliance-evidence-pack.md

Series Navigation

Part 1: Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.
Part 6: Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway
Part 7: Ship FSx for ONTAP Audit Logs to New Relic via Serverless Lambda Pipeline
Part 8: EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration
Part 9: Data Sovereignty: FSx for ONTAP Logs in Your VPC with Elastic
Part 10: High-Cardinality File Access Analysis with Honeycomb + OTel
Part 11: AI-Powered Root Cause: Correlating File Access with APM via Dynatrace
Part 12: FSx for ONTAP Audit Logs with Data Residency in your region with Sumo Logic
Part 13: 9 Vendors, One Architecture (this post)

Thank you for following this series. If you've deployed any of these integrations, I'd love to hear about your experience — drop a comment or open a GitHub issue.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

DEV Community