TL;DR
We built a direct Lambda-to-Grafana Cloud pipeline that ships FSx for ONTAP audit logs to Loki without an intermediate OTel Collector. Three Lambda functions cover all event sources:
- FSx for ONTAP audit logs → EventBridge Scheduler (every 5 min) → Lambda (polls & reads via S3 Access Point) → OTLP Gateway → Loki
- EMS webhooks (ransomware alerts, quota warnings) → API Gateway → Lambda → OTLP Gateway → Loki
- FPolicy file operations (real-time CIFS/SMB events) → ECS Fargate → SQS → Bridge Lambda → EventBridge → Lambda → OTLP Gateway → Loki
Everything is CloudFormation-templated, parameterized, and deployable with a single script. No hardcoded values, and the infrastructure is fully parameterized. This is a Grafana-specific direct integration by design; use the Collector path from Part 5 when you need backend portability.
If you only want to validate the path quickly, jump to First Success Path and deploy the audit poller first.
This is the single-backend counterpart to Part 5: simpler when Grafana Cloud is the chosen destination, less flexible when backend portability, enrichment, redaction, or multi-backend routing is required.
Why Direct Send (Without OTel Collector)?
In Part 5, we showed how the OTel Collector decouples Lambda from backends. That's the right choice when you need multi-backend delivery or vendor migration flexibility.
But if Grafana Cloud is your single observability platform and your goal is a simple serverless path, direct OTLP can be a good starting point. For production pipelines that need richer buffering, metadata enrichment, redaction, or routing, Grafana recommends an Alloy / Collector-based architecture.
| Approach | Components | Latency | Cost |
|---|---|---|---|
| OTel Collector | Lambda → Collector (ECS/EC2) → Grafana | +50-100ms | Collector compute |
| Direct send | Lambda → Grafana OTLP Gateway | Minimal | Lambda only |
The direct path is simpler, cheaper, and has fewer failure points. You can always graduate to the Collector path later (Part 5 shows how). Direct send is a good fit when operational simplicity is more important than in-pipeline enrichment, redaction, buffering, and multi-backend routing. If those requirements become mandatory, move the same OTLP payload model behind Alloy or the OpenTelemetry Collector.
Direct send reduces moving parts, but it also removes the Collector / Alloy queueing layer. For production, decide whether Lambda retry and DLQ are sufficient, or whether you need SQS buffering, DLQ replay, or the Collector / Alloy path for stronger delivery guarantees during endpoint outages or throttling.
Delivery guarantee decision (see full pattern guide):
- Quickstart (this template): Scheduler retry + Scheduler DLQ + Lambda reserved concurrency + checkpoint retry
- Medium volume: add Lambda failure destination and operational replay procedures
- Higher reliability: insert SQS before shipping, or place Alloy / OTel Collector behind Lambda for batching, retry with persistent queue, transform, redaction, and multi-backend routing
- Multi-backend or redaction/routing: use Part 5 Collector path
Architecture
┌─────────────────────────────────────────────────────┐
│ Event Sources │
├─────────────────────────────────────────────────────┤
│ │
│ EventBridge Scheduler │
│ rate(5 minutes) ──→ Lambda │
│ │ lists new files via │
│ │ S3 Access Point │
│ │ (checkpoint in SSM) │
│ ▼ │
│ OTLP Gateway │
│ (Grafana Cloud) │
│ │ │
│ EMS Webhook │ │
│ ──→ API GW ──→ Lambda ────────────┤ │
│ (ems_handler) │ │
│ ▼ │
│ FPolicy Loki │
│ ──→ ECS Fargate ──→ SQS (Explore, │
│ ──→ Bridge Lambda Dashboard) │
│ ──→ EventBridge │
│ ──→ Lambda (fpolicy_handler) ─────────────────────┤
└─────────────────────────────────────────────────────┘
The audit log path uses a polling pattern: EventBridge Scheduler invokes Lambda every 5 minutes. Lambda lists new objects via the S3 Access Point, reads and processes them, then updates an SSM Parameter Store checkpoint to track progress. This avoids reliance on S3 Event Notifications, which are not supported by FSx for ONTAP S3 Access Points.
The same S3 Access Point boundary can be reused for other automation patterns (AI/ML, analytics, compliance archival) because the audit files remain on FSx for ONTAP while Lambda reads them through standard S3 object APIs — no data copy or NFS/SMB mount required.
This pattern does not replace ONTAP audit, EMS, or FPolicy configuration; it provides an AWS-native delivery and visualization layer for those ONTAP-native signals.
For business-critical workloads such as SAP, databases, VDI, or enterprise file services, treat this pipeline as an observability and evidence layer. It complements, but does not replace, workload-specific HA, backup, restore, and DR designs.
Use cases this unlocks:
- Investigate file access activity for FSx for ONTAP-hosted enterprise file shares
- Monitor available ONTAP EMS alerts, such as ransomware-related events, quota warnings, and storage/system events
- Correlate audit logs, EMS, and FPolicy file operations in a single Grafana dashboard
- Provide a lightweight observability path for SAP, database, VDI, and file service workloads using FSx for ONTAP
- Start with direct OTLP delivery and graduate to Alloy / Collector when governance or multi-backend routing is required
The FPolicy path has two Lambda roles: a bridge Lambda that converts ECS/FPolicy server SQS output into EventBridge events, and fpolicy_handler.py, which ships those normalized EventBridge events to Grafana Cloud.
Key Discovery: OTLP Gateway, Not Loki Push API
During E2E verification, the Loki Push API returned HTTP 530 in my trial account. The OTLP Gateway worked reliably in this project and is the recommended Grafana Cloud OTLP ingestion path.
For logs, Grafana Cloud routes OTLP log data to Loki, where it becomes queryable with LogQL.
Our Lambda auto-detects the endpoint mode from the URL:
def _is_otlp_endpoint(endpoint: str) -> bool:
"""Detect Grafana OTLP Gateway or generic OTLP/HTTP logs endpoint."""
endpoint = endpoint.rstrip("/")
return (
"otlp-gateway" in endpoint
or endpoint.endswith("/otlp")
or endpoint.endswith("/otlp/v1/logs")
or endpoint.endswith("/v1/logs")
)
USE_OTLP = _is_otlp_endpoint(LOKI_ENDPOINT)
When using the OTLP Gateway, configure LOKI_ENDPOINT as the base OTLP endpoint ending in /otlp. The Lambda appends /v1/logs when sending logs:
# Configure as base endpoint (Lambda appends /v1/logs)
LOKI_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
# Lambda POSTs to: https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp/v1/logs
The handler also accepts the full path (/otlp/v1/logs) without double-appending.
| Endpoint | URL Pattern | Status |
|---|---|---|
| OTLP Gateway (preferred) | https://otlp-gateway-prod-<region>.grafana.net/otlp |
✅ Recommended by Grafana Cloud docs; verified in this project |
| Loki Push API (fallback) | https://logs-prod-<region>.grafana.net/loki/api/v1/push |
⚠️ May behave differently by account state; returned 530 in my trial validation |
| Self-hosted Loki OTLP | https://<loki-host>/otlp |
Requires Loki OTLP ingestion support and structured metadata configuration; Loki 3.0+ enables structured metadata by default |
Authentication: Basic Auth with base64 Encoding
Grafana Cloud uses Basic Auth for both endpoints. The critical detail: the value is base64(instanceId:apiToken), not plain text concatenation.
from base64 import b64encode
instance_id = "123456" # From Grafana Cloud console
api_token = "glc_..." # logs:write scope
credentials = f"{instance_id}:{api_token}"
auth_header = f"Basic {b64encode(credentials.encode()).decode()}"
Credentials are stored in AWS Secrets Manager as JSON:
{"instance_id": "<id>", "api_key": "<token>"}
The Lambda reads this at cold start and caches the auth header for subsequent invocations. For production, use the shared auth_cache.py module which provides TTL-based caching with automatic reload-on-401/403, so credential rotation does not require waiting for a new Lambda execution environment.
Internally, normalized records are now converted directly to OTLP as the primary path. Loki Push formatting is kept only as a fallback mode. This aligns with Part 5's "OTLP as producer contract" principle. For the full OTLP resource/log-record/body mapping and fsxn.* attribute naming policy, see the Grafana Operations Guide.
The Three Lambda Handlers
1. FSx Audit Log Handler via S3 Access Point (handler.py)
Polls for new FSx ONTAP audit log files via S3 Access Point, parses JSON/EVTX, and ships to Grafana Cloud. Uses SSM Parameter Store to checkpoint progress between invocations.
def lambda_handler(event, context):
auth_header = get_auth_header() # Cached from Secrets Manager
if event.get("source") == "scheduler":
# Polling mode: list new files, process, update checkpoint
last_key = get_checkpoint() # SSM Parameter Store
new_keys = list_new_keys(S3_ACCESS_POINT_ARN, prefix, last_key,
limit=MAX_KEYS_PER_RUN)
for key in new_keys:
if context.get_remaining_time_in_millis() < SAFETY_THRESHOLD_MS:
break # Stop early, resume on next scheduled run
raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=key)
logs = parse_logs(raw["Body"].read(), key)
ship_to_grafana(logs, key, auth_header) # Raises on failure
set_checkpoint(key) # Only after confirmed delivery
else:
# Manual test mode using an S3-event-shaped payload
for record in extract_s3_records(event):
raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=record["key"])
logs = parse_logs(raw["Body"].read(), record["key"])
ship_to_grafana(logs, record["key"], auth_header)
Query in Grafana Explore:
{service_name="fsxn-audit"} | json | Operation="create"
2. EMS Webhook Handler (ems_handler.py)
Receives ONTAP EMS events via API Gateway, parses with the shared EMS parser layer, and forwards to Grafana.
def lambda_handler(event, context):
body = event.get("body", "")
normalized = parse_ems_event(body) # Shared Lambda Layer
if USE_OTLP:
payload = format_for_otlp(normalized)
else:
payload = format_for_loki(normalized)
ship_to_grafana(payload, auth_header)
Labels: {service_name="fsxn-ems", source="ontap", severity="alert"}
Security note: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Use API Gateway authorization controls such as an API key, IAM authorization, Lambda authorizer, resource policy, WAF, or source IP restrictions based on your network design. The quickstart template uses
AuthorizationType: NONEfor simplicity — add appropriate controls before production use. See the webhook security guide for a full comparison of auth modes and a recommended shared-secret Lambda authorizer pattern.
3. FPolicy Handler (fpolicy_handler.py)
Subscribes to EventBridge events from the FPolicy ECS Fargate server and forwards file operation events.
def lambda_handler(event, context):
detail = event.get("detail") # EventBridge event
if USE_OTLP:
payload = format_for_otlp(detail)
else:
payload = format_for_loki(detail)
ship_to_grafana(payload, auth_header)
Labels: {service_name="fsxn-fpolicy", source="ontap", operation="create"}
CloudFormation: Three Templates, Zero Hardcoded Values
Each template is fully parameterized:
| Template | Purpose | Key Parameters |
|---|---|---|
template.yaml |
FSx audit log poller Lambda | S3AccessPointArn, GrafanaCredentialsSecretArn, LokiEndpoint, ScheduleExpression |
template-ems.yaml |
EMS webhook Lambda | GrafanaCredentialsSecretArn, LokiEndpoint, EmsParserLayerArn |
template-fpolicy.yaml |
FPolicy EventBridge Lambda | GrafanaCredentialsSecretArn, LokiEndpoint, EventBusName |
The LokiEndpoint parameter accepts both OTLP Gateway and Loki Push API URLs — the Lambda auto-detects the mode. The quickstart template also sets Lambda reserved concurrency to 1 and provisions a Scheduler DLQ with retry policy to avoid overlapping poller runs and preserve failed scheduled invocations. Processing bounds (MAX_KEYS_PER_RUN, SAFETY_THRESHOLD_MS) are configured via Lambda environment variables.
Trigger Model: EventBridge Scheduler Polling
FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge ObjectCreated events. Instead, this integration uses an EventBridge Scheduler polling pattern:
-
EventBridge Scheduler invokes the Lambda every 5 minutes (configurable via
ScheduleExpressionparameter) -
Lambda lists new files via
ListObjectsV2on the S3 Access Point, usingStartAfterto skip already-processed keys - Lambda reads and processes each new file, shipping logs to Grafana Cloud
- Checkpoint (SSM Parameter Store) tracks the last successfully processed S3 key — on the next invocation, only newer files are processed
This pattern is simple, cost-effective, and works with AWS S3 API-compatible read paths such as FSx for ONTAP S3 Access Points. The trade-off is polling latency (up to 5 minutes by default) vs. the near-real-time delivery of event-driven triggers.
CloudTrail alternative: CloudTrail data events do work with FSx ONTAP S3 Access Points (confirmed by NetApp Workload Factory's Journal table feature). However, CloudTrail data events add additional delivery latency and $0.10/100K events cost (in my validation, the CloudTrail-based path had 5–15 minutes of end-to-end delay), making the polling pattern the better default for this use case. See the CloudTrail trigger alternative for a full analysis and CloudFormation example.
# CloudFormation: EventBridge Scheduler with retry and DLQ
AuditLogSchedule:
Type: AWS::Scheduler::Schedule
Properties:
ScheduleExpression: !Ref ScheduleExpression # default: rate(5 minutes)
FlexibleTimeWindow:
Mode: 'OFF'
Target:
Arn: !GetAtt LogShipperFunction.Arn
RoleArn: !GetAtt SchedulerRole.Arn
Input: !Sub '{"source": "scheduler", "s3_access_point_arn": "${S3AccessPointArn}", "prefix": "${S3KeyPrefix}"}'
RetryPolicy:
MaximumRetryAttempts: 2
MaximumEventAgeInSeconds: 3600
DeadLetterConfig:
Arn: !GetAtt SchedulerDLQ.Arn
The handler also accepts S3 event format for manual testing via aws lambda invoke, so you can still test individual files without waiting for the scheduler.
Checkpoint Semantics
The quickstart uses a simple high-watermark checkpoint: the last successfully processed object key is stored in SSM Parameter Store, and the next run lists keys after that value.
This works when audit log object keys are monotonically increasing and immutable. For production, validate your audit log naming and rotation behavior. If files can arrive late, be overwritten, or appear out of lexical order, use a stronger checkpoint model such as:
- Keeping a short lookback window
- Deduplicating by object key + ETag or LastModified
- Storing per-object processing state in DynamoDB
- Updating the checkpoint only after confirmed Grafana delivery
The checkpoint is advanced only after Grafana returns a successful response for that object. If delivery fails after retries, the Lambda raises an error and the next scheduled run will retry from the last checkpoint.
Failure-path tests verify this behavior: if OTLP delivery returns failure after retries, the Lambda raises and the checkpoint does not advance past the failed object.
Files that parse successfully but contain no shippable records are treated as successfully processed and checkpointed; only delivery failures or parse errors prevent checkpoint advancement.
For production, add a poison-pill policy for files that repeatedly fail parsing or delivery; otherwise one bad file can block later audit logs when using a high-watermark checkpoint. See the Grafana operations guide for poison-pill handling, pipeline health alarms, and custom metrics.
Use SSM Parameter Store for the quickstart high-watermark checkpoint. Move to DynamoDB when you need per-object state, deduplication, replay tracking, or concurrent workers.
Delivery semantics: This pipeline provides at-least-once delivery, not exactly-once. If a Lambda invocation succeeds in sending logs to Grafana but fails before updating the checkpoint (e.g., timeout or transient SSM error), the next run will re-process and re-send those objects. For most observability use cases, duplicate log entries are acceptable. If deduplication is required, implement it explicitly using object key + ETag, event ID, or payload hash in DynamoDB. Do not rely on backend-side deduplication as the primary correctness mechanism.
Avoid Overlapping Poller Runs
Because the audit-log poller is schedule-driven, overlapping Lambda executions can race on the same key range and checkpoint. The quickstart template sets ReservedConcurrentExecutions: 1 to prevent this.
For higher-volume production pipelines, use a distributed lock (e.g., DynamoDB conditional write) and per-object processing state instead of relying on single-concurrency.
The quickstart also configures EventBridge Scheduler with a retry policy (2 retries, 1-hour event age) and a dedicated DLQ. If a scheduled invocation is throttled or fails, the event is preserved in the Scheduler DLQ for visibility and replay.
The quickstart uses 2 retries and 1-hour maximum event age to surface persistent failures quickly while avoiding unbounded retry storms. Increase these values only if your Grafana endpoint outage tolerance and duplicate-handling strategy are defined.
Processing Bounds
The poller bounds work per invocation to avoid timeout-related checkpoint corruption:
-
Max keys per run (
MAX_KEYS_PER_RUN, default: 100): caps the number of files processed in a single invocation -
Safety threshold (
SAFETY_THRESHOLD_MS, default: 30000): stops processing when remaining Lambda time falls below 30 seconds
| Variable | Default | Purpose |
|---|---|---|
MAX_KEYS_PER_RUN |
100 |
Maximum audit log files processed per invocation |
SAFETY_THRESHOLD_MS |
30000 |
Stop processing before Lambda timeout |
Tune these values after observing Lambda duration, checkpoint age, Scheduler DLQ depth, FSx S3 Access Point read throughput, and Grafana send latency.
Because the checkpoint advances after each successfully delivered object, the next scheduled run resumes safely from where the previous run stopped.
S3 API Compatibility Boundary
FSx for ONTAP S3 Access Points provide S3 object API access (GetObject, ListObjectsV2, etc.) to file data that remains on the FSx for ONTAP file system. They should not be assumed to have the same bucket-level features or eventing behavior as standard S3 buckets. In this integration, the important difference is eventing: the audit log path uses Scheduler polling instead of S3 Event Notifications.
Minimum Read-Path Permissions
For the audit-log Lambda, verify:
-
s3:ListBucketon the S3 Access Point ARN -
s3:GetObjecton the S3 Access Point object ARN ({arn}/object/*) - S3 Access Point policy allows the Lambda execution role
- The file-system user associated with the access point has read permission on the audit log path
- If the access point is VPC-restricted, the Lambda network path can reach the S3 endpoint
IAM resource ARN examples:
# List access (s3:ListBucket)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>
# Object read (s3:GetObject)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>/object/*
First Success Path
If this is your first deployment, start small:
# Deploy only the audit log poller
export MAX_KEYS_PER_RUN=1
export SAFETY_THRESHOLD_MS=30000
bash integrations/grafana/scripts/deploy.sh --audit-only
Then validate:
- Confirm
{service_name="fsxn-audit"}in Grafana Explore - Check the Scheduler DLQ is empty
- Verify the SSM checkpoint advanced
- Create the dashboard
- Add EMS and FPolicy only after the audit path works (
deploy.sh --all)
deploy.sh passes MAX_KEYS_PER_RUN and SAFETY_THRESHOLD_MS as Lambda environment variables. If unset, the template defaults (100 / 30000) are used.
The first validation should prove three things:
- One audit file is visible in Grafana (
{service_name="fsxn-audit"}) - The SSM checkpoint advanced to the processed key
- The Scheduler DLQ remains empty
One-Command Deploy and Cleanup
# Deploy all 3 stacks + update Lambda code (default is --all)
export GRAFANA_SECRET_ARN="arn:aws:secretsmanager:ap-northeast-1:<account>:secret:grafana/fsxn-loki-credentials-XXXXXX"
export S3_ACCESS_POINT_ARN="arn:aws:s3:ap-northeast-1:<account>:accesspoint/fsxn-audit-ap"
export LOKI_ENDPOINT="https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp"
bash integrations/grafana/scripts/deploy.sh --all
The cleanup script removes CloudFormation stacks and optionally deletes synthetic test objects. It does not delete production FSx audit files through the FSx-attached S3 Access Point — those remain on the FSx file system. Pass --s3-bucket and --s3-prefix only if you uploaded test data to a regular S3 bucket during validation.
# Tear down everything (dependency-safe order)
bash integrations/grafana/scripts/cleanup.sh --all \
--s3-bucket your-bucket --s3-prefix audit/svm-prod-01/
The cleanup script deletes stacks in dependency-safe order (API Gateway before Lambda) and handles DELETE_FAILED states gracefully.
LogQL Query Examples
High-cardinality fields such as UserName and ObjectName remain in the log body and are extracted at query time with | json; they are intentionally not promoted to Loki labels to avoid index bloat and cost.
Once logs arrive, Grafana Explore becomes your investigation tool:
# All audit logs
{service_name="fsxn-audit"}
# Filter by operation
{service_name="fsxn-audit"} | json | Operation="delete"
# Failed access attempts (security investigation)
{service_name="fsxn-audit"} | json | Result="Failure"
# EMS ransomware alerts
{service_name="fsxn-ems"} | json | event_name="arw.volume.state"
# FPolicy file operations
{service_name="fsxn-fpolicy"} | json | operation="create"
# Human-readable format
{service_name="fsxn-audit"} | json | line_format "{{.UserName}} {{.Operation}} {{.ObjectName}}"
# Log volume over time (for dashboards)
count_over_time({service_name="fsxn-audit"}[5m])
Dashboard: 4 Panels for Storage Observability
The following panel queries are the exact queries generated by scripts/create-dashboard.sh and verified against this project's OTLP-ingested log shape. The repository includes a dashboard creation script that provisions a Grafana dashboard via API with four panels:
-
Log Volume (Time series):
count_over_time({service_name="fsxn-audit"}[5m]) -
Operations Breakdown (Pie chart):
sum by (Operation) (count_over_time({service_name="fsxn-audit"} | json [1h])) -
User Activity Top 10 (Bar gauge):
topk(10, sum by (UserName) (count_over_time({service_name="fsxn-audit"} | json [1h]))) -
Failed Events (Time series):
count_over_time({service_name="fsxn-audit"} | json | Result="Failure" [5m])
Alerting: Ransomware Detection and Security Monitoring
Beyond dashboards, the integration includes three Grafana alerting rules provisioned via scripts/create-alerts.sh:
The table below shows the alert conditions. The provisioning script wraps these into Grafana alert expressions using count/reduce/threshold steps.
| Alert | Detection Query (alert condition) | Severity |
|---|---|---|
| Ransomware Detection (ARP) | `count_over_time({service_name="fsxn-ems"} \ | json \ |
| Quota Soft Limit Exceeded | {% raw %}`count_over_time({service_name="fsxn-ems"} \ | json \ |
| Failed Access Spike | {% raw %}`count_over_time({service_name="fsxn-audit"} \ | json \ |
The rules use Grafana's unified alerting format and are deployed to a "FSxN Alerts" folder. Configure contact points (Slack, PagerDuty, email) and notification policies in the Grafana UI to route alerts by severity or team label. The rule definitions are available as {% raw %}alerting/rules.yaml; see the alerting README for provisioning details, no-data behavior, contact point caveats, and threshold tuning guidance.
API compatibility: This script uses Grafana's Alerting Provisioning HTTP API (
/api/v1/provisioning). Grafana 13+ introduces newer/apisroutes while legacy/apiroutes remain available; check your Grafana Cloud version if provisioning fails. Provisioning alert rules does not automatically configure notification delivery — create or map contact points and notification policies before relying on these alerts for production response.
The sample rules treat "No data" as OK, because absence of matching ransomware, quota, or failed-access events is expected in normal operation. Query execution errors are routed as Error state for operator attention. These thresholds are starter defaults — tune them per SVM, workload, and normal user behavior before enabling production paging.
For production, monitor the pipeline itself: Scheduler DLQ depth, Lambda errors/throttles/duration, checkpoint age, and Grafana send failures.
Scheduler DLQ Replay
The Scheduler DLQ message is primarily an operational signal and replay payload. Because the poller uses a checkpoint, the next scheduled run may already retry the failed key range automatically.
When a scheduled invocation fails and lands in the Scheduler DLQ:
- Inspect the DLQ message (contains the scheduler input payload)
- Check the current checkpoint in SSM Parameter Store
- Check whether a later scheduled run has already advanced the checkpoint and delivered the missed objects
- If the checkpoint has advanced and Grafana shows the data, the failure was auto-recovered — delete the DLQ message
- If the checkpoint has NOT advanced, the next scheduled run will retry automatically from the last checkpoint
- For manual replay (if auto-retry is insufficient): invoke the Lambda directly with the scheduler payload, then delete the DLQ message
Before manually replaying a DLQ message, compare the DLQ payload with the current SSM checkpoint and Grafana ingestion state to avoid duplicate delivery.
For production, set a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 for the Scheduler DLQ.
Lessons Learned
| # | Lesson | Impact |
|---|---|---|
| 1 | Grafana Cloud OTLP endpoint is the recommended ingestion path; in my trial validation, OTLP Gateway succeeded while Loki Push API returned 530 | Use OTLP Gateway as default |
| 2 | Basic Auth = base64(instanceId:apiToken), not plain text |
Auth failures if wrong encoding |
| 3 | Loki / Grafana Cloud can reject old timestamps depending on tenant limits; in my validation, logs older than 7 days were rejected | Use current timestamps in test data |
| 4 | Grafana HTTP API needs a Grafana Service Account token, not the Grafana Cloud ingestion token used for OTLP writes | Dashboard creation fails with wrong token |
| 5 | OTLP-ingested logs use service_name label, not job
|
Different query syntax than Loki Push API |
| 6 | CloudFormation stack deletion order matters (API GW before Lambda) | DELETE_FAILED if wrong order |
Verified Query Matrix
In this Grafana Cloud environment, service.name was exposed as the service_name index label via Loki's default OTLP attribute-to-label mapping. This mapping is configurable per tenant, so validate labels in your own environment if queries return unexpected results.
All queries tested with OTLP-ingested fields in this project's Grafana Cloud instance:
| Query | Expected | Verified |
|---|---|---|
{service_name="fsxn-audit"} |
Audit logs visible | ✅ |
| `{service_name="fsxn-audit"} \ | json \ | Operation="delete"` |
| `{service_name="fsxn-audit"} \ | json \ | Result="Failure"` |
| `{service_name="fsxn-ems"} \ | json \ | event_name="arw.volume.state"` |
| `{service_name="fsxn-fpolicy"} \ | json \ | operation="create"` |
count_over_time({service_name="fsxn-audit"}[5m]) |
Time series data | ✅ |
Production and PoC Resources
For deeper validation and production planning:
- Delivery Guarantee Patterns — Quickstart → Medium → Higher reliability → Multi-backend
- Webhook Security Guide — Auth modes, Lambda authorizer, production baseline
- Grafana Operations Guide — Alarms, tuning, poison-pill, ownership, compliance
- CloudTrail Trigger Alternative — Event-driven alternative analysis
- PoC Checklist — Go/No-Go criteria for stakeholder sign-off
- Cost Model — Direct send vs Collector vs Firehose cost comparison
- Alerting README — Provisioning details, thresholds, contact point caveats
- Graduating to Alloy — Move from direct Lambda OTLP send to an Alloy-backed telemetry pipeline
- Partner Solution Brief — Target customers, PoC scope, deliverables, and responsibility boundaries
What's Next
- Part 7: Splunk HEC — serverless log delivery with built-in Firehose support
- Elastic integration: Bulk API with date-based indices
- Cost model refinement: validate the Cost Model with measured volume tiers from real-world FSx for ONTAP workloads
Series Navigation
- Part 1: Why Your FSx for ONTAP Logs Deserve Better
- Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
- Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
- Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
- Part 5: Escape Vendor Lock-in with OTel Collector
- Part 6: Direct-to-Grafana: Shipping Logs via OTLP Gateway (this post)
Questions about the Grafana Cloud integration or OTLP Gateway? Drop a comment below.
Previous: Part 5 — Escape Vendor Lock-in with OTel Collector
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
Top comments (0)