DEV Community

Cover image for 9 Services, One Architecture: What We Learned Shipping FSx for ONTAP Logs to Every Major Observability Platform

9 Services, One Architecture: What We Learned Shipping FSx for ONTAP Logs to Every Major Observability Platform

TL;DR

We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture:

For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead of hours, zero operational burden. Four vendors offer permanent free tiers covering most FSx for ONTAP deployments (New Relic 100 GB, Grafana Cloud 50 GB, Honeycomb 20M events, Sumo Logic 500 MB/day).

                    ┌─────────────────────────────────────────────┐
                    │         One Architecture, 9 Backends        │
                    ├─────────────────────────────────────────────┤
                    │                                             │
                    │  FSx for ONTAP ──→ S3 Access Point          │
                    │       │                                     │
                    │       ▼                                     │
                    │  EventBridge Scheduler (5 min)              │
                    │       │                                     │
                    │       ▼                                     │
                    │  Lambda (vendor-specific handler)           │
                    │       │                                     │
                    │       ├──→ Datadog (Logs API v2)            │
                    │       ├──→ New Relic (Log API v1)           │
                    │       ├──→ Splunk (HEC)                     │
                    │       ├──→ Grafana Cloud (OTLP Gateway)     │
                    │       ├──→ Elastic (Bulk API)               │
                    │       ├──→ Dynatrace (Log Ingest v2)        │
                    │       ├──→ Sumo Logic (HTTP Source)         │
                    │       ├──→ Honeycomb (Events Batch API)     │
                    │       └──→ OTel Collector (OTLP/HTTP)       │
                    │                                             │
                    └─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

12 articles, 9 vendors, 3 event sources (audit logs, EMS webhooks, FPolicy), all CloudFormation-templated, all tested with real FSx for ONTAP data. This post distills what we learned.

This is Part 13 — the series finale — of Serverless Observability for FSx for ONTAP.


The Architecture That Survived 9 Integrations

After implementing 9 vendor integrations, the core pattern remained unchanged:

def lambda_handler(event, context):
    # 1. Get cached credentials (Secrets Manager + TTL, default 5 min)
    creds = auth.get()

    # 2. List new files since checkpoint (S3 AP + SSM)
    new_keys = list_new_keys(s3_ap_arn, prefix, checkpoint)

    # 3. Read, parse, format, ship per file (vendor-specific)
    #    (Simplified — actual implementation batches events across files
    #     and respects vendor-specific batch size limits)
    for key in new_keys:
        logs = read_and_parse(key)
        payload = format_for_vendor(logs)  # Only this changes per vendor

    # 4. Ship with retry (vendor API)
        ship_to_vendor(payload, creds)

    # 5. Advance checkpoint (only after confirmed delivery)
        update_checkpoint(key)
Enter fullscreen mode Exit fullscreen mode

What changes per vendor: only the formatting and HTTP call (~50-100 lines). Everything else — S3 AP access, checkpoint management, DLQ handling, credential caching, retry logic — is shared.

Cross-Vendor Comparison: The Numbers

API Characteristics

Vendor Endpoint Auth Model Max Batch Success Code Firehose
Datadog Logs API v2 Header (DD-API-KEY) 5 MB / 1000 items 200 Yes
New Relic Log API v1 Header (Api-Key) 1 MB 202 Yes
Splunk HEC Header (Splunk <token>) No hard limit 200 Yes (built-in)
Grafana OTLP Gateway Basic Auth (base64) ~4 MB 200 No
Elastic Bulk API Header (ApiKey <b64>) ~10 MB 200 No
Dynatrace Log Ingest v2 Header (Api-Token) 1 MB 204 Via ActiveGate
Sumo Logic HTTP Source URL-embedded token 1 MB 200 No
Honeycomb Events Batch Header (x-honeycomb-team) 5 MB (impl: 100/batch) 200 No
OTel Collector OTLP/HTTP Configurable Configurable 200 No

Cost at 10 GB/month

Vendor Vendor Cost AWS Infra Total Free Tier
Sumo Logic $0 ~$5 ~$5 500 MB/day
Honeycomb $0 ~$5 ~$5 20M events/month
New Relic $0 ~$5 ~$5 100 GB/month
Grafana Cloud $0 ~$5 ~$5 50 GB logs/month
Datadog ~$15 ~$5 ~$20 Logs: 14-day trial only
Dynatrace ~$25 ~$5 ~$30 14-day trial
Elastic Cloud ~$95 ~$5 ~$100 14-day trial
Splunk Cloud ~$150+ ~$5 ~$155+ N/A

AWS infrastructure cost is consistent across all vendors (~$5/month for Lambda + EventBridge + Secrets Manager). The vendor platform cost is the differentiator.

Data Residency

Vendor Tokyo (JP) US EU Self-Hosted
Sumo Logic Yes Yes Yes No
Elastic Yes Yes Yes Yes
Dynatrace Yes (region-specific) Yes Yes Yes (Managed)
Datadog No Yes Yes No
New Relic No (July 2026 planned) Yes Yes No
Grafana Cloud Dedicated only Yes Yes No (Alloy self-hosted)
Splunk No Yes Yes Yes
Honeycomb No Yes No No

Governance note: This table provides technical awareness for vendor selection. Grafana Cloud offers Tokyo region on Dedicated tier (not Free/Pro). Data residency alone does not constitute regulatory compliance. Evaluate your specific requirements (APPI, GDPR, FISC, ISMAP) with your compliance team. See the Retention Policy Matrix for regulation-to-vendor mapping.

Unique Strengths

Vendor Best For
Datadog Full-stack APM correlation, broadest feature set
New Relic Generous free tier (100 GB), NRQL power
Splunk Existing Splunk shops, SPL expertise, Firehose native
Grafana Cloud OTLP-native, LogQL, open-source ecosystem
Elastic Data sovereignty (self-hosted), ECS/SIEM, Kibana
Dynatrace Davis AI root cause analysis, APM correlation
Sumo Logic JP region data residency, generous free tier
Honeycomb High-cardinality analysis (BubbleUp, Heatmaps)
OTel Collector Multi-backend, vendor portability, redaction

Note on Grafana ecosystem: Grafana Alloy (formerly Grafana Agent) provides a Grafana-native alternative to the OpenTelemetry Collector with the same OTLP compatibility. Grafana Cloud's OTLP Gateway is available on all tiers including Free (US/EU regions only). For Tokyo data residency, Grafana Cloud Dedicated is required.

7 Patterns That Survived All 9 Integrations

1. Polling > Event-Driven (for FSx for ONTAP S3 AP)

FSx for ONTAP S3 Access Points don't support S3 Event Notifications. We evaluated CloudTrail data events as an alternative — however, CloudTrail data events for FSx for ONTAP S3 AP access are not consistently available across all configurations. The 5-minute EventBridge Scheduler poll is simpler, cheaper, and sufficient for audit log use cases where near-real-time (not real-time) delivery is acceptable.

2. Checkpoint-After-Delivery

Never advance the checkpoint before confirming vendor delivery. This single rule prevents data loss across all failure modes:

# CORRECT: checkpoint after confirmed delivery
ship_to_vendor(payload)  # Raises on failure
update_checkpoint(key)   # Only reached on success

# WRONG: checkpoint before delivery
update_checkpoint(key)   # What if ship_to_vendor fails next?
ship_to_vendor(payload)  # Data loss if this fails
Enter fullscreen mode Exit fullscreen mode

3. Credential Caching with Reload-on-401

Every vendor integration uses the same SecretBackedAuth pattern: cache credentials at cold start, reload on TTL expiry or 401/403. This handles credential rotation without Lambda redeployment.

4. Reserved Concurrency = 1

The audit poller must not run concurrently (checkpoint race condition). ReservedConcurrentExecutions: 1 is the simplest guard. For higher throughput, move to DynamoDB-based per-object locking.

5. DLQ for Every Async Path

Every template includes a KMS-encrypted DLQ. In 9 integrations, the DLQ caught: vendor outages, credential expiry, malformed files, and Lambda timeouts. Without it, these failures would be silent data loss.

6. Vendor-Specific Batch Limits Matter

The biggest implementation difference across vendors is batch size handling:

Vendor Limit Lambda Behavior
Honeycomb 100 events Split into chunks of 100
Dynatrace / Sumo Logic 1 MB Measure payload size, split at boundary
Datadog 5 MB / 1000 items Dual limit check
Elastic ~10 MB Rarely hit with audit logs

7. OTLP as the Universal Format

If you're unsure which vendor you'll use long-term, start with OTLP. The OTel Collector integration (Part 5) proved that a single Lambda producing OTLP can feed Datadog, Grafana, and Honeycomb simultaneously — with zero code changes when adding or removing backends.

Beyond multi-backend delivery, the OTel Collector provides:

  • Enrichment: Resource detection, Kubernetes attributes, custom metadata injection
  • Sampling: Tail-based sampling for high-volume environments
  • Redaction: PII field removal/masking before data leaves your account (see PII Redaction Cookbook)
  • Format conversion: OTLP ↔ vendor-native format translation

Verified version: All OTel Collector configurations in this series were tested with OpenTelemetry Collector Contrib v0.152.0. OTel Collector has frequent releases with potential breaking changes — pin your version in production and test before upgrading.

What We'd Do Differently

Start with OTel Collector for Multi-Vendor Evaluation

If evaluating multiple vendors, deploy the OTel Collector path first. It lets you send the same data to 2-3 vendors simultaneously for comparison, without deploying separate Lambda stacks per vendor.

Define SLOs Before Building

We defined Pipeline SLOs after building all 9 integrations. In hindsight, defining "< 10 min delivery latency" and "< 0.01% data loss" upfront would have guided design decisions earlier (e.g., checkpoint granularity, retry policy).

Data Classification First

Audit logs contain PII (usernames, file paths). We documented this in the Data Classification Guide after implementation. For regulated environments, classify fields before choosing a vendor — it may eliminate options that don't support your data residency requirements.

Production Readiness Framework

After 9 integrations, we formalized a 4-level production readiness model:

Level What Go/No-Go to Next
Level 1: Quickstart Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2: Operational PoC + Dashboard + alerts SLOs met 7 days, security review done
Level 3: Production + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack
Level 4: Enterprise + OTel Collector + redaction Multi-backend, PII redaction, DR tested

Most PoCs should target Level 2. Production deployments need Level 3. Enterprise pipelines with compliance requirements need Level 4.

Recommended transition timeline:

  • Level 1 → Level 2: ~1 week (add dashboards, define SLOs, validate 7-day stability)
  • Level 2 → Level 3: ~2-4 weeks (deploy DynamoDB ledger, implement poison-pill handling, complete security review)
  • Level 3 → Level 4: ~1-2 months (deploy OTel Collector, implement PII redaction, test DR failover, complete compliance evidence pack)

Full criteria: Pipeline SLO Definitions

Vendor Selection Decision Tree

Start
  |
  +-- Need JP data residency?
  |   +-- Yes -> Sumo Logic (JP) or Elastic (self-hosted in Tokyo VPC)
  |   +-- No  |
  |           v
  +-- Need self-hosted (air-gapped)?
  |   +-- Yes -> Elastic or Splunk
  |   +-- No  |
  |           v
  +-- Already have an observability platform?
  |   +-- Yes -> Use that vendor (all 9 are supported)
  |   +-- No  |
  |           v
  +-- Budget constraint (free tier needed)?
  |   +-- Yes -> Sumo Logic (500 MB/day) or Honeycomb (20M events) or New Relic (100 GB)
  |   +-- No  |
  |           v
  +-- Need AI-powered root cause analysis?
  |   +-- Yes -> Dynatrace (Davis AI)
  |   +-- No  |
  |           v
  +-- Need high-cardinality analysis?
  |   +-- Yes -> Honeycomb (BubbleUp)
  |   +-- No  |
  |           v
  +-- Need multi-backend / vendor portability?
  |   +-- Yes -> OTel Collector
  |   +-- No  |
  |           v
  +-- Default -> Datadog (broadest) or Grafana (OTLP-native, open ecosystem)
Enter fullscreen mode Exit fullscreen mode

The FSx for ONTAP S3 AP Constraint That Shaped Everything

The single most impactful technical constraint: FSx for ONTAP S3 Access Points do not support S3 Event Notifications.

This one fact drove:

  • EventBridge Scheduler polling pattern (not event-driven)
  • SSM Parameter Store checkpointing (track what's been processed)
  • Reserved concurrency = 1 (prevent checkpoint races)
  • Safety threshold (stop before Lambda timeout)
  • MAX_KEYS_PER_RUN (bound processing per invocation)

If FSx for ONTAP S3 APs add event notification support in the future, the architecture could simplify significantly. As of May 2026, this feature is not supported, and the polling pattern is battle-tested across 9 vendors.

Cost Reality: EC2 vs Serverless

The original motivation: replace the EC2-based Splunk pattern (2x EC2 instances) with serverless.

Metric EC2 Pattern Serverless Pattern
Monthly AWS cost ~$66 ~$5-8
OS patching Required None
Scaling Manual Automatic
Vendor support Splunk only 9 vendors
Deploy time Hours 30 minutes
Recovery from failure Manual restart Automatic (DLQ + retry)

90% cost reduction with zero operational burden. The serverless pattern wins on every dimension except one: real-time latency (EC2 syslog can be sub-second; our poller is 5-minute intervals). For audit logs, 5 minutes is acceptable. For real-time needs, use the FPolicy path (< 30 seconds).

What's Next

This series covered the foundation. The project continues with:

See the full ROADMAP.

Resources

Series Navigation


Thank you for following this series. If you've deployed any of these integrations, I'd love to hear about your experience — drop a comment or open a GitHub issue.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Top comments (0)