DEV Community

Cover image for Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway

Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway

TL;DR

We built a direct Lambda-to-Grafana Cloud pipeline that ships FSx for ONTAP audit logs to Loki without an intermediate OTel Collector. Three Lambda functions cover all event sources:

  • FSx for ONTAP audit logs → EventBridge Scheduler (every 5 min) → Lambda (polls & reads via S3 Access Point) → OTLP Gateway → Loki
  • EMS webhooks (ransomware alerts, quota warnings) → API Gateway → Lambda → OTLP Gateway → Loki
  • FPolicy file operations (real-time CIFS/SMB events) → ECS Fargate → SQS → Bridge Lambda → EventBridge → Lambda → OTLP Gateway → Loki

Everything is CloudFormation-templated, parameterized, and deployable with a single script. No hardcoded values, and the infrastructure is fully parameterized. This is a Grafana-specific direct integration by design; use the Collector path from Part 5 when you need backend portability.

If you only want to validate the path quickly, jump to First Success Path and deploy the audit poller first.


This is the single-backend counterpart to Part 5: simpler when Grafana Cloud is the chosen destination, less flexible when backend portability, enrichment, redaction, or multi-backend routing is required.

Why Direct Send (Without OTel Collector)?

In Part 5, we showed how the OTel Collector decouples Lambda from backends. That's the right choice when you need multi-backend delivery or vendor migration flexibility.

But if Grafana Cloud is your single observability platform and your goal is a simple serverless path, direct OTLP can be a good starting point. For production pipelines that need richer buffering, metadata enrichment, redaction, or routing, Grafana recommends an Alloy / Collector-based architecture.

Approach Components Latency Cost
OTel Collector Lambda → Collector (ECS/EC2) → Grafana +50-100ms Collector compute
Direct send Lambda → Grafana OTLP Gateway Minimal Lambda only

The direct path is simpler, cheaper, and has fewer failure points. You can always graduate to the Collector path later (Part 5 shows how). Direct send is a good fit when operational simplicity is more important than in-pipeline enrichment, redaction, buffering, and multi-backend routing. If those requirements become mandatory, move the same OTLP payload model behind Alloy or the OpenTelemetry Collector.

Direct send reduces moving parts, but it also removes the Collector / Alloy queueing layer. For production, decide whether Lambda retry and DLQ are sufficient, or whether you need SQS buffering, DLQ replay, or the Collector / Alloy path for stronger delivery guarantees during endpoint outages or throttling.

Delivery guarantee decision (see full pattern guide):

  • Quickstart (this template): Scheduler retry + Scheduler DLQ + Lambda reserved concurrency + checkpoint retry
  • Medium volume: add Lambda failure destination and operational replay procedures
  • Higher reliability: insert SQS before shipping, or place Alloy / OTel Collector behind Lambda for batching, retry with persistent queue, transform, redaction, and multi-backend routing
  • Multi-backend or redaction/routing: use Part 5 Collector path

Architecture

┌─────────────────────────────────────────────────────┐
│ Event Sources                                        │
├─────────────────────────────────────────────────────┤
│                                                      │
│  EventBridge Scheduler                               │
│  rate(5 minutes) ──→ Lambda                          │
│                       │ lists new files via           │
│                       │ S3 Access Point              │
│                       │ (checkpoint in SSM)          │
│                       ▼                              │
│                OTLP Gateway                          │
│                (Grafana Cloud)                        │
│                       │                              │
│  EMS Webhook          │                              │
│  ──→ API GW ──→ Lambda ────────────┤                │
│     (ems_handler)                   │                │
│                                     ▼                │
│  FPolicy                           Loki             │
│  ──→ ECS Fargate ──→ SQS          (Explore,        │
│  ──→ Bridge Lambda                  Dashboard)      │
│  ──→ EventBridge                                    │
│  ──→ Lambda (fpolicy_handler) ─────────────────────┤
└─────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The audit log path uses a polling pattern: EventBridge Scheduler invokes Lambda every 5 minutes. Lambda lists new objects via the S3 Access Point, reads and processes them, then updates an SSM Parameter Store checkpoint to track progress. This avoids reliance on S3 Event Notifications, which are not supported by FSx for ONTAP S3 Access Points.

The same S3 Access Point boundary can be reused for other automation patterns (AI/ML, analytics, compliance archival) because the audit files remain on FSx for ONTAP while Lambda reads them through standard S3 object APIs — no data copy or NFS/SMB mount required.

This pattern does not replace ONTAP audit, EMS, or FPolicy configuration; it provides an AWS-native delivery and visualization layer for those ONTAP-native signals.

For business-critical workloads such as SAP, databases, VDI, or enterprise file services, treat this pipeline as an observability and evidence layer. It complements, but does not replace, workload-specific HA, backup, restore, and DR designs.

Use cases this unlocks:

  • Investigate file access activity for FSx for ONTAP-hosted enterprise file shares
  • Monitor available ONTAP EMS alerts, such as ransomware-related events, quota warnings, and storage/system events
  • Correlate audit logs, EMS, and FPolicy file operations in a single Grafana dashboard
  • Provide a lightweight observability path for SAP, database, VDI, and file service workloads using FSx for ONTAP
  • Start with direct OTLP delivery and graduate to Alloy / Collector when governance or multi-backend routing is required

The FPolicy path has two Lambda roles: a bridge Lambda that converts ECS/FPolicy server SQS output into EventBridge events, and fpolicy_handler.py, which ships those normalized EventBridge events to Grafana Cloud.

Key Discovery: OTLP Gateway, Not Loki Push API

During E2E verification, the Loki Push API returned HTTP 530 in my trial account. The OTLP Gateway worked reliably in this project and is the recommended Grafana Cloud OTLP ingestion path.

For logs, Grafana Cloud routes OTLP log data to Loki, where it becomes queryable with LogQL.

Our Lambda auto-detects the endpoint mode from the URL:

def _is_otlp_endpoint(endpoint: str) -> bool:
    """Detect Grafana OTLP Gateway or generic OTLP/HTTP logs endpoint."""
    endpoint = endpoint.rstrip("/")
    return (
        "otlp-gateway" in endpoint
        or endpoint.endswith("/otlp")
        or endpoint.endswith("/otlp/v1/logs")
        or endpoint.endswith("/v1/logs")
    )

USE_OTLP = _is_otlp_endpoint(LOKI_ENDPOINT)
Enter fullscreen mode Exit fullscreen mode

When using the OTLP Gateway, configure LOKI_ENDPOINT as the base OTLP endpoint ending in /otlp. The Lambda appends /v1/logs when sending logs:

# Configure as base endpoint (Lambda appends /v1/logs)
LOKI_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
# Lambda POSTs to: https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp/v1/logs
Enter fullscreen mode Exit fullscreen mode

The handler also accepts the full path (/otlp/v1/logs) without double-appending.

Endpoint URL Pattern Status
OTLP Gateway (preferred) https://otlp-gateway-prod-<region>.grafana.net/otlp ✅ Recommended by Grafana Cloud docs; verified in this project
Loki Push API (fallback) https://logs-prod-<region>.grafana.net/loki/api/v1/push ⚠️ May behave differently by account state; returned 530 in my trial validation
Self-hosted Loki OTLP https://<loki-host>/otlp Requires Loki OTLP ingestion support and structured metadata configuration; Loki 3.0+ enables structured metadata by default

Authentication: Basic Auth with base64 Encoding

Grafana Cloud uses Basic Auth for both endpoints. The critical detail: the value is base64(instanceId:apiToken), not plain text concatenation.

from base64 import b64encode

instance_id = "123456"  # From Grafana Cloud console
api_token = "glc_..."   # logs:write scope

credentials = f"{instance_id}:{api_token}"
auth_header = f"Basic {b64encode(credentials.encode()).decode()}"
Enter fullscreen mode Exit fullscreen mode

Credentials are stored in AWS Secrets Manager as JSON:

{"instance_id": "<id>", "api_key": "<token>"}
Enter fullscreen mode Exit fullscreen mode

The Lambda reads this at cold start and caches the auth header for subsequent invocations. For production, use the shared auth_cache.py module which provides TTL-based caching with automatic reload-on-401/403, so credential rotation does not require waiting for a new Lambda execution environment.

Internally, normalized records are now converted directly to OTLP as the primary path. Loki Push formatting is kept only as a fallback mode. This aligns with Part 5's "OTLP as producer contract" principle. For the full OTLP resource/log-record/body mapping and fsxn.* attribute naming policy, see the Grafana Operations Guide.

The Three Lambda Handlers

1. FSx Audit Log Handler via S3 Access Point (handler.py)

Polls for new FSx ONTAP audit log files via S3 Access Point, parses JSON/EVTX, and ships to Grafana Cloud. Uses SSM Parameter Store to checkpoint progress between invocations.

def lambda_handler(event, context):
    auth_header = get_auth_header()  # Cached from Secrets Manager

    if event.get("source") == "scheduler":
        # Polling mode: list new files, process, update checkpoint
        last_key = get_checkpoint()  # SSM Parameter Store
        new_keys = list_new_keys(S3_ACCESS_POINT_ARN, prefix, last_key,
                                 limit=MAX_KEYS_PER_RUN)

        for key in new_keys:
            if context.get_remaining_time_in_millis() < SAFETY_THRESHOLD_MS:
                break  # Stop early, resume on next scheduled run
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=key)
            logs = parse_logs(raw["Body"].read(), key)
            ship_to_grafana(logs, key, auth_header)  # Raises on failure
            set_checkpoint(key)  # Only after confirmed delivery
    else:
        # Manual test mode using an S3-event-shaped payload
        for record in extract_s3_records(event):
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=record["key"])
            logs = parse_logs(raw["Body"].read(), record["key"])
            ship_to_grafana(logs, record["key"], auth_header)
Enter fullscreen mode Exit fullscreen mode

Query in Grafana Explore:

{service_name="fsxn-audit"} | json | Operation="create"
Enter fullscreen mode Exit fullscreen mode

2. EMS Webhook Handler (ems_handler.py)

Receives ONTAP EMS events via API Gateway, parses with the shared EMS parser layer, and forwards to Grafana.

def lambda_handler(event, context):
    body = event.get("body", "")
    normalized = parse_ems_event(body)  # Shared Lambda Layer

    if USE_OTLP:
        payload = format_for_otlp(normalized)
    else:
        payload = format_for_loki(normalized)

    ship_to_grafana(payload, auth_header)
Enter fullscreen mode Exit fullscreen mode

Labels: {service_name="fsxn-ems", source="ontap", severity="alert"}

Security note: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Use API Gateway authorization controls such as an API key, IAM authorization, Lambda authorizer, resource policy, WAF, or source IP restrictions based on your network design. The quickstart template uses AuthorizationType: NONE for simplicity — add appropriate controls before production use. See the webhook security guide for a full comparison of auth modes and a recommended shared-secret Lambda authorizer pattern.

3. FPolicy Handler (fpolicy_handler.py)

Subscribes to EventBridge events from the FPolicy ECS Fargate server and forwards file operation events.

def lambda_handler(event, context):
    detail = event.get("detail")  # EventBridge event

    if USE_OTLP:
        payload = format_for_otlp(detail)
    else:
        payload = format_for_loki(detail)

    ship_to_grafana(payload, auth_header)
Enter fullscreen mode Exit fullscreen mode

Labels: {service_name="fsxn-fpolicy", source="ontap", operation="create"}

CloudFormation: Three Templates, Zero Hardcoded Values

Each template is fully parameterized:

Template Purpose Key Parameters
template.yaml FSx audit log poller Lambda S3AccessPointArn, GrafanaCredentialsSecretArn, LokiEndpoint, ScheduleExpression
template-ems.yaml EMS webhook Lambda GrafanaCredentialsSecretArn, LokiEndpoint, EmsParserLayerArn
template-fpolicy.yaml FPolicy EventBridge Lambda GrafanaCredentialsSecretArn, LokiEndpoint, EventBusName

The LokiEndpoint parameter accepts both OTLP Gateway and Loki Push API URLs — the Lambda auto-detects the mode. The quickstart template also sets Lambda reserved concurrency to 1 and provisions a Scheduler DLQ with retry policy to avoid overlapping poller runs and preserve failed scheduled invocations. Processing bounds (MAX_KEYS_PER_RUN, SAFETY_THRESHOLD_MS) are configured via Lambda environment variables.

Trigger Model: EventBridge Scheduler Polling

FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge ObjectCreated events. Instead, this integration uses an EventBridge Scheduler polling pattern:

  1. EventBridge Scheduler invokes the Lambda every 5 minutes (configurable via ScheduleExpression parameter)
  2. Lambda lists new files via ListObjectsV2 on the S3 Access Point, using StartAfter to skip already-processed keys
  3. Lambda reads and processes each new file, shipping logs to Grafana Cloud
  4. Checkpoint (SSM Parameter Store) tracks the last successfully processed S3 key — on the next invocation, only newer files are processed

This pattern is simple, cost-effective, and works with AWS S3 API-compatible read paths such as FSx for ONTAP S3 Access Points. The trade-off is polling latency (up to 5 minutes by default) vs. the near-real-time delivery of event-driven triggers.

CloudTrail alternative: CloudTrail data events do work with FSx ONTAP S3 Access Points (confirmed by NetApp Workload Factory's Journal table feature). However, CloudTrail data events add additional delivery latency and $0.10/100K events cost (in my validation, the CloudTrail-based path had 5–15 minutes of end-to-end delay), making the polling pattern the better default for this use case. See the CloudTrail trigger alternative for a full analysis and CloudFormation example.

# CloudFormation: EventBridge Scheduler with retry and DLQ
AuditLogSchedule:
  Type: AWS::Scheduler::Schedule
  Properties:
    ScheduleExpression: !Ref ScheduleExpression  # default: rate(5 minutes)
    FlexibleTimeWindow:
      Mode: 'OFF'
    Target:
      Arn: !GetAtt LogShipperFunction.Arn
      RoleArn: !GetAtt SchedulerRole.Arn
      Input: !Sub '{"source": "scheduler", "s3_access_point_arn": "${S3AccessPointArn}", "prefix": "${S3KeyPrefix}"}'
      RetryPolicy:
        MaximumRetryAttempts: 2
        MaximumEventAgeInSeconds: 3600
      DeadLetterConfig:
        Arn: !GetAtt SchedulerDLQ.Arn
Enter fullscreen mode Exit fullscreen mode

The handler also accepts S3 event format for manual testing via aws lambda invoke, so you can still test individual files without waiting for the scheduler.

Checkpoint Semantics

The quickstart uses a simple high-watermark checkpoint: the last successfully processed object key is stored in SSM Parameter Store, and the next run lists keys after that value.

This works when audit log object keys are monotonically increasing and immutable. For production, validate your audit log naming and rotation behavior. If files can arrive late, be overwritten, or appear out of lexical order, use a stronger checkpoint model such as:

  • Keeping a short lookback window
  • Deduplicating by object key + ETag or LastModified
  • Storing per-object processing state in DynamoDB
  • Updating the checkpoint only after confirmed Grafana delivery

The checkpoint is advanced only after Grafana returns a successful response for that object. If delivery fails after retries, the Lambda raises an error and the next scheduled run will retry from the last checkpoint.

Failure-path tests verify this behavior: if OTLP delivery returns failure after retries, the Lambda raises and the checkpoint does not advance past the failed object.

Files that parse successfully but contain no shippable records are treated as successfully processed and checkpointed; only delivery failures or parse errors prevent checkpoint advancement.

For production, add a poison-pill policy for files that repeatedly fail parsing or delivery; otherwise one bad file can block later audit logs when using a high-watermark checkpoint. See the Grafana operations guide for poison-pill handling, pipeline health alarms, and custom metrics.

Use SSM Parameter Store for the quickstart high-watermark checkpoint. Move to DynamoDB when you need per-object state, deduplication, replay tracking, or concurrent workers.

Delivery semantics: This pipeline provides at-least-once delivery, not exactly-once. If a Lambda invocation succeeds in sending logs to Grafana but fails before updating the checkpoint (e.g., timeout or transient SSM error), the next run will re-process and re-send those objects. For most observability use cases, duplicate log entries are acceptable. If deduplication is required, implement it explicitly using object key + ETag, event ID, or payload hash in DynamoDB. Do not rely on backend-side deduplication as the primary correctness mechanism.

Avoid Overlapping Poller Runs

Because the audit-log poller is schedule-driven, overlapping Lambda executions can race on the same key range and checkpoint. The quickstart template sets ReservedConcurrentExecutions: 1 to prevent this.

For higher-volume production pipelines, use a distributed lock (e.g., DynamoDB conditional write) and per-object processing state instead of relying on single-concurrency.

The quickstart also configures EventBridge Scheduler with a retry policy (2 retries, 1-hour event age) and a dedicated DLQ. If a scheduled invocation is throttled or fails, the event is preserved in the Scheduler DLQ for visibility and replay.

The quickstart uses 2 retries and 1-hour maximum event age to surface persistent failures quickly while avoiding unbounded retry storms. Increase these values only if your Grafana endpoint outage tolerance and duplicate-handling strategy are defined.

Processing Bounds

The poller bounds work per invocation to avoid timeout-related checkpoint corruption:

  • Max keys per run (MAX_KEYS_PER_RUN, default: 100): caps the number of files processed in a single invocation
  • Safety threshold (SAFETY_THRESHOLD_MS, default: 30000): stops processing when remaining Lambda time falls below 30 seconds
Variable Default Purpose
MAX_KEYS_PER_RUN 100 Maximum audit log files processed per invocation
SAFETY_THRESHOLD_MS 30000 Stop processing before Lambda timeout

Tune these values after observing Lambda duration, checkpoint age, Scheduler DLQ depth, FSx S3 Access Point read throughput, and Grafana send latency.

Because the checkpoint advances after each successfully delivered object, the next scheduled run resumes safely from where the previous run stopped.

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points provide S3 object API access (GetObject, ListObjectsV2, etc.) to file data that remains on the FSx for ONTAP file system. They should not be assumed to have the same bucket-level features or eventing behavior as standard S3 buckets. In this integration, the important difference is eventing: the audit log path uses Scheduler polling instead of S3 Event Notifications.

Minimum Read-Path Permissions

For the audit-log Lambda, verify:

  • s3:ListBucket on the S3 Access Point ARN
  • s3:GetObject on the S3 Access Point object ARN ({arn}/object/*)
  • S3 Access Point policy allows the Lambda execution role
  • The file-system user associated with the access point has read permission on the audit log path
  • If the access point is VPC-restricted, the Lambda network path can reach the S3 endpoint

IAM resource ARN examples:

# List access (s3:ListBucket)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>

# Object read (s3:GetObject)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>/object/*
Enter fullscreen mode Exit fullscreen mode

First Success Path

If this is your first deployment, start small:

# Deploy only the audit log poller
export MAX_KEYS_PER_RUN=1
export SAFETY_THRESHOLD_MS=30000
bash integrations/grafana/scripts/deploy.sh --audit-only
Enter fullscreen mode Exit fullscreen mode

Then validate:

  1. Confirm {service_name="fsxn-audit"} in Grafana Explore
  2. Check the Scheduler DLQ is empty
  3. Verify the SSM checkpoint advanced
  4. Create the dashboard
  5. Add EMS and FPolicy only after the audit path works (deploy.sh --all)

deploy.sh passes MAX_KEYS_PER_RUN and SAFETY_THRESHOLD_MS as Lambda environment variables. If unset, the template defaults (100 / 30000) are used.

The first validation should prove three things:

  • One audit file is visible in Grafana ({service_name="fsxn-audit"})
  • The SSM checkpoint advanced to the processed key
  • The Scheduler DLQ remains empty

One-Command Deploy and Cleanup

# Deploy all 3 stacks + update Lambda code (default is --all)
export GRAFANA_SECRET_ARN="arn:aws:secretsmanager:ap-northeast-1:<account>:secret:grafana/fsxn-loki-credentials-XXXXXX"
export S3_ACCESS_POINT_ARN="arn:aws:s3:ap-northeast-1:<account>:accesspoint/fsxn-audit-ap"
export LOKI_ENDPOINT="https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp"

bash integrations/grafana/scripts/deploy.sh --all
Enter fullscreen mode Exit fullscreen mode

The cleanup script removes CloudFormation stacks and optionally deletes synthetic test objects. It does not delete production FSx audit files through the FSx-attached S3 Access Point — those remain on the FSx file system. Pass --s3-bucket and --s3-prefix only if you uploaded test data to a regular S3 bucket during validation.

# Tear down everything (dependency-safe order)
bash integrations/grafana/scripts/cleanup.sh --all \
  --s3-bucket your-bucket --s3-prefix audit/svm-prod-01/
Enter fullscreen mode Exit fullscreen mode

The cleanup script deletes stacks in dependency-safe order (API Gateway before Lambda) and handles DELETE_FAILED states gracefully.

LogQL Query Examples

High-cardinality fields such as UserName and ObjectName remain in the log body and are extracted at query time with | json; they are intentionally not promoted to Loki labels to avoid index bloat and cost.

Once logs arrive, Grafana Explore becomes your investigation tool:

# All audit logs
{service_name="fsxn-audit"}

# Filter by operation
{service_name="fsxn-audit"} | json | Operation="delete"

# Failed access attempts (security investigation)
{service_name="fsxn-audit"} | json | Result="Failure"

# EMS ransomware alerts
{service_name="fsxn-ems"} | json | event_name="arw.volume.state"

# FPolicy file operations
{service_name="fsxn-fpolicy"} | json | operation="create"

# Human-readable format
{service_name="fsxn-audit"} | json | line_format "{{.UserName}} {{.Operation}} {{.ObjectName}}"

# Log volume over time (for dashboards)
count_over_time({service_name="fsxn-audit"}[5m])
Enter fullscreen mode Exit fullscreen mode

Dashboard: 4 Panels for Storage Observability

The following panel queries are the exact queries generated by scripts/create-dashboard.sh and verified against this project's OTLP-ingested log shape. The repository includes a dashboard creation script that provisions a Grafana dashboard via API with four panels:

  1. Log Volume (Time series): count_over_time({service_name="fsxn-audit"}[5m])
  2. Operations Breakdown (Pie chart): sum by (Operation) (count_over_time({service_name="fsxn-audit"} | json [1h]))
  3. User Activity Top 10 (Bar gauge): topk(10, sum by (UserName) (count_over_time({service_name="fsxn-audit"} | json [1h])))
  4. Failed Events (Time series): count_over_time({service_name="fsxn-audit"} | json | Result="Failure" [5m])

Alerting: Ransomware Detection and Security Monitoring

Beyond dashboards, the integration includes three Grafana alerting rules provisioned via scripts/create-alerts.sh:

The table below shows the alert conditions. The provisioning script wraps these into Grafana alert expressions using count/reduce/threshold steps.

Alert Detection Query (alert condition) Severity
Ransomware Detection (ARP) `count_over_time({service_name="fsxn-ems"} \ json \
Quota Soft Limit Exceeded {% raw %}`count_over_time({service_name="fsxn-ems"} \ json \
Failed Access Spike {% raw %}`count_over_time({service_name="fsxn-audit"} \ json \

The rules use Grafana's unified alerting format and are deployed to a "FSxN Alerts" folder. Configure contact points (Slack, PagerDuty, email) and notification policies in the Grafana UI to route alerts by severity or team label. The rule definitions are available as {% raw %}alerting/rules.yaml; see the alerting README for provisioning details, no-data behavior, contact point caveats, and threshold tuning guidance.

API compatibility: This script uses Grafana's Alerting Provisioning HTTP API (/api/v1/provisioning). Grafana 13+ introduces newer /apis routes while legacy /api routes remain available; check your Grafana Cloud version if provisioning fails. Provisioning alert rules does not automatically configure notification delivery — create or map contact points and notification policies before relying on these alerts for production response.

The sample rules treat "No data" as OK, because absence of matching ransomware, quota, or failed-access events is expected in normal operation. Query execution errors are routed as Error state for operator attention. These thresholds are starter defaults — tune them per SVM, workload, and normal user behavior before enabling production paging.

For production, monitor the pipeline itself: Scheduler DLQ depth, Lambda errors/throttles/duration, checkpoint age, and Grafana send failures.

Scheduler DLQ Replay

The Scheduler DLQ message is primarily an operational signal and replay payload. Because the poller uses a checkpoint, the next scheduled run may already retry the failed key range automatically.

When a scheduled invocation fails and lands in the Scheduler DLQ:

  1. Inspect the DLQ message (contains the scheduler input payload)
  2. Check the current checkpoint in SSM Parameter Store
  3. Check whether a later scheduled run has already advanced the checkpoint and delivered the missed objects
  4. If the checkpoint has advanced and Grafana shows the data, the failure was auto-recovered — delete the DLQ message
  5. If the checkpoint has NOT advanced, the next scheduled run will retry automatically from the last checkpoint
  6. For manual replay (if auto-retry is insufficient): invoke the Lambda directly with the scheduler payload, then delete the DLQ message

Before manually replaying a DLQ message, compare the DLQ payload with the current SSM checkpoint and Grafana ingestion state to avoid duplicate delivery.

For production, set a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 for the Scheduler DLQ.

Lessons Learned

# Lesson Impact
1 Grafana Cloud OTLP endpoint is the recommended ingestion path; in my trial validation, OTLP Gateway succeeded while Loki Push API returned 530 Use OTLP Gateway as default
2 Basic Auth = base64(instanceId:apiToken), not plain text Auth failures if wrong encoding
3 Loki / Grafana Cloud can reject old timestamps depending on tenant limits; in my validation, logs older than 7 days were rejected Use current timestamps in test data
4 Grafana HTTP API needs a Grafana Service Account token, not the Grafana Cloud ingestion token used for OTLP writes Dashboard creation fails with wrong token
5 OTLP-ingested logs use service_name label, not job Different query syntax than Loki Push API
6 CloudFormation stack deletion order matters (API GW before Lambda) DELETE_FAILED if wrong order

Verified Query Matrix

In this Grafana Cloud environment, service.name was exposed as the service_name index label via Loki's default OTLP attribute-to-label mapping. This mapping is configurable per tenant, so validate labels in your own environment if queries return unexpected results.

All queries tested with OTLP-ingested fields in this project's Grafana Cloud instance:

Query Expected Verified
{service_name="fsxn-audit"} Audit logs visible
`{service_name="fsxn-audit"} \ json \ Operation="delete"`
`{service_name="fsxn-audit"} \ json \ Result="Failure"`
`{service_name="fsxn-ems"} \ json \ event_name="arw.volume.state"`
`{service_name="fsxn-fpolicy"} \ json \ operation="create"`
count_over_time({service_name="fsxn-audit"}[5m]) Time series data

Production and PoC Resources

For deeper validation and production planning:

What's Next

  • Part 7: Splunk HEC — serverless log delivery with built-in Firehose support
  • Elastic integration: Bulk API with date-based indices
  • Cost model refinement: validate the Cost Model with measured volume tiers from real-world FSx for ONTAP workloads

Series Navigation


Questions about the Grafana Cloud integration or OTLP Gateway? Drop a comment below.

Previous: Part 5 — Escape Vendor Lock-in with OTel Collector

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Top comments (0)