DEV Community: CubeAPM

The Complete Guide to AWS Service Monitoring with OpenTelemetry (2026)

CubeAPM — Wed, 20 May 2026 06:30:48 +0000

_Originally published on cubeapm.com
_
Nearly three-quarters of organizations are either already using OpenTelemetry or actively planning to adopt it, according to research by Enterprise Management Associates. On AWS, that shift is not theoretical anymore. It changes how engineering teams instrument services, route telemetry, and pick backends.

This AWS service monitoring guide is for DevOps engineers, SREs, and platform teams running workloads on AWS. It covers how OpenTelemetry fits into AWS environments, which instrumentation patterns hold up in production, how to route telemetry to different backends, and where teams get tripped up at scale. The goal is a complete, honest picture, not a vendor pitch.

What Is OpenTelemetry and Why Does It Matter for AWS

OpenTelemetry (OTel) is a CNCF-graduated open-source project. It provides a unified standard for collecting metrics, logs, and traces from distributed systems. At its core, OTel consists of three things:

Language-specific SDKs for auto- and manual instrumentation

A vendor-neutral wire protocol called OTLP (OpenTelemetry Protocol)

The OpenTelemetry Collector, a standalone pipeline component that receives, processes, and exports telemetry

The most important thing to understand is what OpenTelemetry is not. It is not a monitoring backend. It does not store data, draw dashboards, or fire alerts. It is the collection and transport layer. Think of it as the plumbing between your services and whatever system your team uses to analyze the data.

On AWS, the traditional approach uses CloudWatch as both collector and backend, with X-Ray handling distributed tracing. That model works well for purely AWS-native stacks. Teams run into friction when:

Services span multiple AWS accounts, regions, or clouds

Engineers need to correlate traces and metrics without jumping between X-Ray and CloudWatch Metrics separately

The AWS-native cost model with per-metric, per-API call, and per-log GB charges becomes expensive at scale

Teams want to switch backends without re-instrumenting every service

OpenTelemetry solves all of this by separating instrumentation from the destination. Teams instrument code once with OTel SDKs, route data through the Collector, and send it to any OTLP-compatible backend, such as CubeAPM, or multiple destinations simultaneously.

The OTel Data Model: Metrics, Traces, and Logs

Before setting up pipelines, it helps to understand how OpenTelemetry models telemetry data and how each signal maps to AWS monitoring use cases.

Metrics

Metrics are numeric measurements over time. CPU utilization, request count, error rate, and queue depth. In OpenTelemetry, metrics carry rich semantic labels called attributes and support multiple aggregation types: counters, gauges, and histograms.

Compared to standard CloudWatch metrics, OTel metrics support higher attribute cardinality and finer aggregation control. That matters when filtering simultaneously by service, region, and deployment version.

Traces

A trace represents the path one request takes through a distributed system. It consists of spans. Each span captures a unit of work with a start time, duration, parent-child relationship, and attributes.

Traces are what let teams answer "why was this specific request slow?" rather than "why is p99 latency elevated?" That distinction matters when you are investigating an incident at 2 a.m.

Logs

Logs are structured or unstructured text records emitted by services. OTel's approach differs from CloudWatch Logs in one important way. OTel attaches trace context, specifically trace ID and span ID, to log records. That makes it possible to jump from a log line directly to the trace that produced it, and back again, without any manual correlation.

The AWS Distro for OpenTelemetry (ADOT)

AWS maintains its own supported distribution of the OTel Collector called the AWS Distro for OpenTelemetry, or ADOT. It is an AWS-tested build of the upstream Collector that includes AWS-specific components pre-packaged and validated.

*ADOT includes:
*

AWS-specific receivers and exporters for X-Ray, CloudWatch Metrics, and EMF
Support for AWS authentication via IAM roles and instance profiles
Pre-built ECS task definitions and EKS add-on configurations
Integration with CloudWatch Container Insights

ADOT is a sensible starting point for teams beginning with OTel on AWS. The trade-off is that it runs behind the upstream Collector in terms of feature availability. If you need components not bundled in ADOT, you either wait for them to be included or run the upstream contrib Collector instead.

Use ADOT when the stack is AWS-native, CloudWatch is the primary backend, and your team wants AWS-supported tooling. Consider the upstream OTel Collector Contrib when you need flexibility, faster access to new components, or multi-destination routing to non-AWS backends.

How to Monitor AWS Services with OpenTelemetry

Not all AWS services expose telemetry in the same way. The right instrumentation approach depends on whether the service runs on open-source technology or is AWS-proprietary.

Open-Source-Based AWS Services: Direct OTel Collection

Several AWS managed services run on well-known open-source technology that OpenTelemetry already has receivers for. For these, teams can collect metrics directly using OTel receivers, bypassing CloudWatch and its associated streaming costs:

Direct collection avoids CloudWatch streaming costs and gives teams higher-cardinality metrics with more control over collection interval and attribute enrichment.

AWS-Proprietary Services: CloudWatch Metric Streams

For services without an open-source equivalent, EC2 at the host level, Lambda, API Gateway, S3, SQS, and DynamoDB, CloudWatch is the primary telemetry source. The OTel-compatible path is CloudWatch Metric Streams paired with an Amazon Data Firehose delivery stream.

*The data flows like this:
*
*AWS Services > CloudWatch > Metric Stream > Amazon Data Firehose > OTLP Endpoint
*

CloudWatch Metric Streams support OpenTelemetry 1.0 as an output format. Metrics arrive at the backend in standard OTLP format without requiring transformation. This eliminates a whole class of mapping problems that plagued earlier CloudWatch integrations.

Here is the high-level setup:

Create a Firehose delivery stream pointed at your OTLP-compatible backend HTTP endpoint.
In CloudWatch, create a Metric Stream using "Custom setup with Firehose".
Select "OpenTelemetry 1.0" as the output format.
Choose which AWS namespaces to stream, for example, AWS/EC2, AWS/Lambda, AWS/SQS.

Expect roughly 5 to 10 minutes for streaming to stabilize and data to appear in the backend.

Application-Level Instrumentation with OTel SDKs

For monitoring application code running on AWS rather than the infrastructure itself, teams instrument using OTel language SDKs. Lambda, ECS tasks, EKS pods, and EC2 workloads all support this.

Most major languages have stable SDKs with auto-instrumentation:

Java: opentelemetry-java-agent for zero-code auto-instrumentation
Python: opentelemetry-instrument command-line auto-instrumentation
Node.js: @opentelemetry/auto-instrumentations-node
Go: Manual instrumentation only (no official auto-instrumentation agent as of May 2026)
.NET: OpenTelemetry.AutoInstrumentation

Auto-instrumentation captures traces from common frameworks, including HTTP clients, database drivers, and message queues, with minimal code changes. Manual instrumentation is required for custom business logic spans, business-specific attributes, or proprietary internal systems.

Set OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES as environment variables in ECS task definitions or EKS pod specs. This ensures every trace and metric carries a consistent service identity without hardcoding it inside application code.

The OpenTelemetry Collector: Configuration Patterns for AWS

The OTel Collector is the central component in most AWS monitoring pipelines. It runs as a sidecar, DaemonSet, or standalone service. It receives telemetry from SDKs, processes it, and forwards it to one or more backends.

Every Collector configuration has three sections. Receivers define how data comes in. Processors define what happens to it. Exporters define where it goes.

Deployment Patterns on AWS

Sidecar per Pod (EKS)

The Collector runs as a sidecar container in each pod. This gives per-pod resource isolation and fine-grained configuration control. The downside is higher resource overhead since every pod runs its own Collector instance.

DaemonSet (EKS or EC2)

One Collector runs per node. This is the most common production pattern. It works well for infrastructure metrics and log collection. Applications send telemetry to the node-level Collector via localhost.

Standalone Collector Cluster

A dedicated deployment of horizontally scaled Collector instances receives telemetry from all sources. Teams use this pattern in high-volume environments where sampling, filtering, and transformation need separation from collection.

According to the OpenTelemetry Collector Follow-up Survey published in January 2026, 65% of organizations now run more than 10 Collector instances in production. Kubernetes remains the dominant deployment environment at 81%, while VM usage has grown from 33% to 51% year over year.

Key Processors Every AWS Team Should Know

Batch Processor

The batch processor combines spans and metric data points before exporting. It reduces API calls and improves throughput. Use it in almost every pipeline.

processors:

  batch:

    timeout: 5s

    send_batch_size: 1000processors:

  batch:

    timeout: 5s

    send_batch_size: 1000

Memory Limiter Processor

This processor prevents the Collector from consuming unbounded memory under high load. It is essential in production.

processors:

  memory_limiter:

    check_interval: 1s

    limit_mib: 512

    spike_limit_mib: 128processors:

  memory_limiter:

    check_interval: 1s

    limit_mib: 512

    spike_limit_mib: 128

Resource Detection Processor

This processor automatically enriches telemetry with AWS resource attributes by querying the EC2 metadata service and EKS API. It adds EC2 instance type, region, availability zone, EKS cluster name, and more without manual configuration.

processors:

  resourcedetection:

    detectors: [env, ec2, eks, ecs]

    timeout: 5sprocessors:

  resourcedetection:

    detectors: [env, ec2, eks, ecs]

    timeout: 5s

Filter Processor

The filter processor drops metrics or spans matching defined criteria. Teams use it to remove low-value telemetry, such as health check traces, before data reaches a paid backend.

Tail-Based Sampling Processor

This processor makes sampling decisions after a full trace is collected. It lets teams retain 100% of error traces and slow traces regardless of total volume. It requires a stateful Collector deployment because all spans from a trace must arrive at the same Collector instance.

Routing Telemetry: Backend Options for AWS OTel Data

Once the OTel Collector is collecting and processing telemetry, it needs somewhere to send it. Backend selection is one of the more consequential decisions in an OTel deployment, and it is worth thinking through before the pipeline goes to production.

Amazon CloudWatch

CloudWatch is the natural first stop for AWS-native teams. It accepts OTel metrics via Metric Streams and added direct OTLP ingestion for metrics in April 2026, including anomaly detection on OTel metrics without requiring static thresholds. For traces, AWS X-Ray remains the native backend, and ADOT includes X-Ray exporters.

*CloudWatch works well for:
*

Teams operating entirely within AWS with no multi-cloud requirements
Organizations already invested in CloudWatch dashboards and alarm workflows
Teams that want a fully managed backend with no additional infrastructure to run

The friction shows up at scale. Per-metric charges, per-API call fees, and per-GB log ingestion add up faster than most teams expect. Correlating traces in X-Ray with metrics in CloudWatch also means jumping between two separate consoles during an incident, which slows the investigation down. Custom metrics have limited cardinality support compared to OTel-native backends.

Open-Source Backends

Teams that want full control over their observability stack often build around the Prometheus and Grafana ecosystem. Prometheus handles metrics, Loki handles logs, and Tempo handles traces. The OTel Collector exports to each component via standard protocols. Grafana Cloud offers a managed hosted version if running all three components is too much to operate.

This approach gives maximum flexibility and avoids vendor lock-in at both the instrumentation and storage layers. The trade-off is operational complexity. Running and maintaining multiple components, wiring up trace-to-log correlation, and managing retention across separate systems takes real engineering effort. It is a good fit for teams with strong platform engineering capacity who want to own the full stack.

Commercial SaaS APM Platforms

Most commercial APM platforms now accept OTLP ingestion natively. Teams run the OTel Collector to collect and preprocess telemetry, then forward it to the SaaS platform via its OTLP endpoint. This preserves vendor-neutral instrumentation while keeping the managed analytics experience.

This model suits teams that want a fully managed backend with rich dashboards, anomaly detection, and broad integration support, without operating their own infrastructure. It is common in enterprises with existing software contracts already in place.

The cost trade-off is significant at scale. Most SaaS platforms bill per host, per metric, or per data volume across multiple dimensions. In dynamic AWS environments where ECS tasks scale up and down and Lambda concurrency spikes, those bills become hard to predict. Lock-in at the instrumentation layer is reduced since teams use OTel SDKs, but lock-in at the storage and query layer remains.

Self-Hosted OpenTelemetry-Native Backends

A growing category of backends ingests OTLP natively and stores metrics, traces, logs, and events in a single unified data model. These run inside the customer's own cloud account or VPC. Telemetry never leaves the customer's environment.

CubeAPM falls into this category. It is built natively on OpenTelemetry and designed specifically for teams that want full-stack observability without sending data to an external SaaS platform. For AWS environments, it supports both CloudWatch Metric Streams for proprietary services and direct OTel collection for open-source-based services like RDS, ElastiCache, and MSK. It accepts telemetry from OTel Collectors, Prometheus exporters, and existing vendor agents, so teams already running proprietary agents can migrate without re-instrumenting application code.

Pricing is per GB of ingestion at a flat rate, which keeps costs predictable as EKS clusters scale, Lambda concurrency grows, and service count increases. Smart sampling retains traces tied to errors and performance anomalies while reducing storage overhead on routine traffic.

*Self-hosted OTel-native backends are a good fit for:
*
Teams with data residency or compliance requirements that prevent sending telemetry to external systems

Organizations where SaaS observability costs have become a material budget concern

Environments that span AWS and non-AWS infrastructure and need consistent telemetry semantics across both

The Multi-Backend Pattern

The OTel Collector supports exporting to multiple destinations simultaneously. Many production teams use this to send telemetry to a full-fidelity backend for investigation while keeping CloudWatch fed for AWS-native alarms and operational dashboards.

exporters:

  otlp/primary:

    endpoint: "your-backend:4317"

  awscloudwatchmetrics:

    namespace: "YourService"

    region: "us-east-1"

service:

  pipelines:

    metrics:

      receivers: [otlp]

      processors: [batch, memory_limiter]

      exporters: [otlp/primary, awscloudwatchmetrics]exporters:

  otlp/primary:

    endpoint: "your-backend:4317"

  awscloudwatchmetrics:

    namespace: "YourService"

    region: "us-east-1"

service:

  pipelines:

    metrics:

      receivers: [otlp]

      processors: [batch, memory_limiter]

      exporters: [otlp/primary, awscloudwatchmetrics]

This avoids having to choose between AWS-native alarming and deeper observability tooling. Both can coexist in the same pipeline.

AWS Service-Specific Monitoring Guidance

Amazon EKS

EKS monitoring has three distinct layers, and each needs separate handling.

Control Plane

Enable EKS Control Plane Logging in the AWS console. Use the k8sclusterreceiver to collect Kubernetes object metrics, including pod counts, deployment status, and node conditions.

Nodes

Deploy the OTel Collector as a DaemonSet. Use the hostmetricsreceiver for CPU, memory, disk, and network metrics. Add the kubeletstatsreceiver for container-level metrics.

Applications

Inject OTel SDK instrumentation into application containers. Configure pods to send traces to the node-level Collector:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

*Key metrics to track:
*

Node CPU and memory pressure
Pod restart counts
Container CPU throttling rate
API server request latency

AWS Lambda

Lambda monitoring with OTel uses either the OTel Lambda layer, which wraps the Lambda runtime, or manual SDK initialization in function code.

The OTel Lambda layer is available as an AWS-managed Lambda Layer ARN. It supports auto-instrumentation for Python, Java, and Node.js. Add the layer and set these environment variables:

`AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-handler

OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector-endpoint

OTEL_SERVICE_NAME=my-function`

OTel instrumentation adds latency to cold starts. For latency-sensitive functions, use the OTel SDK without auto-instrumentation. Manual initialization is lighter and gives more control over what gets traced.

*Key metrics to track:
*

Duration at average and p99
Error rate
Throttle count
Concurrent executions
Cold start frequency

Amazon RDS

For RDS instances running MySQL or PostgreSQL, the OTel Collector connects directly to the database endpoint using the mysqlreceiver or postgresqlreceiver. These receivers pull performance schema metrics: query latency, connection counts, buffer pool hit rates, and replication lag.

This requires network connectivity between the Collector and the RDS endpoint, which typically means both are in the same VPC. Store the read-only monitoring user credential in AWS Secrets Manager rather than in the Collector config file.

*Key metrics to track:
*

Connection count relative to the configured maximum
Read and write latency
Slow query count
Replication lag on read replicas
Buffer pool hit ratio

Amazon SQS and Amazon MSK (Kafka)

For SQS, use CloudWatch Metric Streams to capture queue depth, messages sent and received, and approximate age of the oldest message. There is no native OTel receiver for SQS as of May 2026.

For MSK (Kafka), the kafkametricsreceiver connects to the cluster and collects broker, topic, and consumer group metrics directly, bypassing CloudWatch costs.

The single most important metric to watch for both services is consumer lag. A growing queue depth on SQS or a growing offset lag on a Kafka consumer group is almost always the first signal that a downstream consumer is falling behind, well before the problem becomes visible in error rates or user-facing metrics.

A Real-World Scenario: Tracing a Latency Problem Across AWS Services

A mid-sized e-commerce company runs its checkout flow on AWS: an EKS-based checkout service, PostgreSQL on RDS, an SQS queue feeding order processing, and a Lambda function handling post-purchase notifications. For the first year, they monitor everything through CloudWatch.

The Problem

On a Saturday afternoon, checkout slows down. P99 latency climbs from 400ms to 2.8 seconds. The team opens CloudWatch and sees EKS CPU is fine, RDS CPU is at 78%, SQS queue depth is normal, and Lambda error rate is zero.

RDS CPU is elevated, but CloudWatch cannot tell them which query is slow, which service called it, or whether the issue is query performance or connection pool exhaustion. The team spends 90 minutes pulling slow query logs manually and cross-referencing application logs. They find a likely culprit but cannot be certain. The incident resolves as traffic drops. Root cause stays uncertain.

What Changes with OpenTelemetry

Two weeks Later: The team instruments the stack:

Java auto-instrumentation on the checkout service, capturing HTTP and JDBC spans

A DaemonSet Collector on each EKS node with the postgresqlreceiver for RDS metrics

CloudWatch Metric Streams via Firehose for SQS and Lambda metrics

The OTel Lambda layer on the notification function

The resource detection processor automatically attaches service.name, deployment.environment, cloud.region, and EKS cluster name to every span and metric.

Next Saturday: Latency spikes again. This time, the team finds the slow trace within 30 seconds. It shows the full request path through the API handler, the database call, and the SQS publish. The slow span is a PostgreSQL query on the orders table taking 1.9 seconds, with the full query text and execution time in the span attributes.

Cross-referencing the postgresqlreceiver metrics confirms a high sequential scan count. A query filtering by customer_id and status together is running a full table scan because the composite index covering both columns does not exist.

From symptom to root cause: 8 minutes. They add the composite index, latency drops immediately, and they set an alert on sequential scan rate so the pattern triggers a notification before it becomes user-facing next time.

What Made the Difference

CloudWatch showed that the RDS CPU was high. The trace showed which query was slow, in which service, triggered by which user action. One tells you something is wrong. The other tells you exactly what to fix.

*A few specifics worth noting from this setup:
*

The postgresqlreceiver connected directly to the RDS endpoint from inside the VPC, bypassing CloudWatch streaming costs entirely
SQS metrics came via CloudWatch Metric Streams since there is no native OTel receiver for SQS
Lambda used manual SDK initialization rather than full auto-instrumentation to keep cold start overhead under 50ms
Tail-based sampling retained 100% of slow checkout traces during the spike, while the sampling routine successful requests at 15%

The team did not redesign anything. They added instrumentation to what already existed and gained the ability to see inside requests rather than watching resource metrics from the outside.

Sampling Strategies

Sampling is one of the most consequential decisions in any OTel deployment, and it is where teams most often make expensive mistakes.

At scale, sending 100% of traces to any backend is not sustainable. The challenge is keeping the traces that actually matter, errors, outliers, and slow requests, while discarding routine traffic.

Head-Based Sampling

Head-based sampling makes the keep-or-drop decision at the very start of a request, before any processing happens. It is simple and stateless. The problem is that it sometimes discards traces that turn out to contain errors discovered later in the request lifecycle. Teams find this out after an incident when they go looking for a trace and find nothing.

Tail-Based Sampling

Tail-based sampling makes the decision after a complete trace is assembled. This means teams can always keep error traces and high-latency traces. It requires a stateful Collector deployment. All spans from a given trace must arrive at the same Collector instance for the decision to be accurate. That typically means a standalone Collector cluster rather than a DaemonSet.

Tail-based sampling is the correct default for production systems where missing an error trace is unacceptable.

A practical starting configuration:

Keep 100% of traces containing errors
Keep 100% of traces exceeding the p99 latency threshold
Sample healthy traces at 10 to 20% initially
Adjust rates per service based on actual traffic volume observed over the first 30 days

Alerting and SLOs with OpenTelemetry Data

OpenTelemetry does not define alerting. That is the backend's responsibility. But OTel data enables more precise alerting than traditional metric-only approaches.

Metric-Based Alerts

Set a threshold on a metric like error rate, queue depth, or p99 latency. The advantage of OTel metrics is that alerts can filter by service, deployment version, or availability zone using the attribute labels OTel automatically attaches. Achieving the same with most CloudWatch custom metrics requires significant additional configuration.

Trace-Based Alerting

Some backends support alerting when a specific span pattern appears. For example, a database query consistently exceeds 500ms. This catches subtle service degradation before it impacts aggregate metrics, which makes it more useful than metric-based alerting for catching problems early in their lifecycle.

Service Level Objectives

Define SLOs as ratios of good requests to total requests. With OTel traces, good can be defined precisely. For example, HTTP 2xx responses with latency under 300ms measured at the span level. This is more accurate than SLOs built on synthetic checks or CloudWatch-level aggregations that mask individual request behavior.

Cost Management for AWS OTel Pipelines

Telemetry costs on AWS come from two places: AWS-side costs, including CloudWatch streaming and Firehose, and backend-side costs, including storage, query execution, and data retention.

Reducing AWS-Side Costs

Stream only the CloudWatch namespaces your team actively queries

For open-source-based services like RDS, ElastiCache, and MSK, use direct OTel collection rather than CloudWatch streaming

Configure Firehose buffering appropriately to reduce per-request API call volume

Reducing Backend-Side Costs

Apply filter processors in the Collector to drop low-value metrics, for example, per-minute health check granularity when 5-minute aggregates meet the team's alerting needs.

Use tail-based sampling rather than storing every trace

Set explicit retention policies. Raw trace data does not need 13 months of retention for most teams. Thirty to 90 days of traces with longer metric retention is a common and practical policy.

The OTel Collector is the right place to make cost decisions, not the backend. Filter, sample, and transform data before it reaches storage. Changes made upstream are always cheaper than changes made after the data is already indexed and stored.

Migration Path: Moving from CloudWatch to OTel

Moving to an OTel-based architecture does not require a big-bang cutover. Most successful migrations follow a phased approach.

Phase 1: Add OTel Alongside CloudWatch

Deploy the OTel Collector in parallel with the existing CloudWatch setup. Instrument one or two services. Route OTel data to a backend. Compare what appears there versus CloudWatch. This builds confidence in the new pipeline without operational risk.

Phase 2: Expand Instrumentation

Instrument remaining services. Add CloudWatch Metric Streams for proprietary AWS services. Set up sampling policies.

Phase 3: Consolidate Backends

Once OTel-sourced data becomes the primary source of truth for alerts and dashboards, reduce CloudWatch dependency to AWS-operational signals only: billing alerts and service health events.

Phase 4: Evaluate Long-Term Architecture

At this point, teams have enough production data to decide whether a single backend meets their needs or whether a CloudWatch plus OTel-backend hybrid is the right long-term model.

The 2025 OTel Collector survey identified the most common production challenges: unreliable health checks, backward compatibility issues during Collector upgrades, and missing receivers for specific platforms. Plan for these in any migration. Stage Collector upgrades and maintains a tested rollback procedure. Those two practices prevent most of the production incidents teams encounter during this transition.

Common Pitfalls and How to Avoid Them

Missing Resource Attributes

Traces and metrics from different services cannot be correlated if they do not share a consistent service.name or deployment.environment. Standardize these attributes before onboarding the first service. Enforce them via the OTel Collector's resource processor if needed. Retrofitting attribute consistency across a large, running deployment is painful and disruptive.

Collector as a Single Point of Failure

A single Collector instance handling all telemetry is a single point of failure. In production, run at least two Collector instances behind a load balancer for any high-volume pipeline. This is not optional for anything customer-facing.

Over-Instrumentation

Adding spans to every function call produces enormous trace volumes with low diagnostic value. Instrument service boundaries, external calls, database queries, and message queue operations. Internal helper functions do not need spans unless they are suspected problem areas.

Cold Start Overhead in Lambda

OTel SDK initialization adds to the Lambda cold start time. Profile this in staging before deploying to production. It is usually acceptable, but for latency-critical functions, it occasionally is not, and discovering this during a production incident is entirely avoidable.

Static Collector Configuration

Telemetry volume grows as services and traffic scale. Build Collector configuration changes into infrastructure-as-code from day one. A config.yaml committed to a repository and deployed through a pipeline is significantly easier to manage than hand-edited configurations scattered across Collector instances.

How CubeAPM Fits into AWS OpenTelemetry Monitoring

CubeAPM is an observability platform built natively on OpenTelemetry. It runs inside the customer's own cloud environment, meaning telemetry data stays within the customer's AWS account or VPC and is never sent to an external SaaS backend. The platform covers the full MELT stack: metrics, events, logs, and traces, all stored under a single unified data model.

In the context of this guide, CubeAPM acts as the backend receiving telemetry from the OTel Collector pipelines described in earlier sections. It supports both ingestion paths: direct OTel collection for open-source-based AWS services like RDS, ElastiCache, and MSK, and CloudWatch Metric Streams via Firehose for AWS-proprietary services like Lambda, SQS, and API Gateway.

No Re-Instrumentation Required

Teams already running Datadog, Elastic, or New Relic agents do not need to re-instrument their applications to evaluate CubeAPM. CubeAPM accepts telemetry directly from all three agent formats, alongside native OTel SDKs. Switching means changing the agent's destination endpoint, not touching application code.

This is relevant in AWS environments where services have run with proprietary agents for years. Teams can run CubeAPM in parallel, validate coverage, and migrate incrementally.

Pricing in AWS Context

CubeAPM uses a flat per-GB ingestion model at $0.15 per GB, with no per-host, per-container, or per-metric fees. For AWS environments where ECS tasks scale up and down, Lambda concurrency spikes, and EKS nodes autoscale, per-host billing produces unpredictable costs. A per-GB model scales with actual data volume rather than instance count.

Teams ingesting 10 TB of telemetry per month pay approximately $1,500 with CubeAPM, saving almost 50% in cost compared to big vendors in observability.

When CubeAPM Fits

CubeAPM suits AWS teams where:

Data residency or compliance requirements prevent sending telemetry to an external SaaS backend
Cost predictability is a priority as EKS clusters, Lambda invocations, and service count grow
Migration from Datadog or New Relic needs to happen without re-instrumenting services
Environments span AWS and non-AWS infrastructure and need consistent OTel semantics across both

It is a less natural fit for teams that want a fully managed SaaS experience with no responsibility for operating the observability backend, or teams running purely within the AWS-native toolchain, where CloudWatch and X-Ray already meet their needs.

Key Decisions for AWS OTel Monitoring: A Summary

For teams setting up or revisiting AWS monitoring with OpenTelemetry in 2026, five decisions matter most.

Direct collection versus CloudWatch Metric Streams: Use direct OTel receivers for services built on open-source technology. Use Metric Streams for proprietary AWS services where CloudWatch is the only available telemetry path.

ADOT versus the upstream OTel Collector Contrib: Start with ADOT for AWS-native stacks that want AWS-supported tooling. Move to upstream contrib when you need components or flexibility that ADOT does not include.

Backend selection: CloudWatch fits AWS-only stacks. The Grafana stack or commercial APM platforms fit teams that want managed analytics. Self-hosted OTel-native backends like CubeAPM or SigNoz fit teams with data control requirements or high-volume cost concerns.

Sampling strategy: Default to tail-based sampling in production. Keep 100% of error and slow traces. Sample healthy traffic at a lower rate.

Attribute standardization: Define and enforce a resource attribute schema covering service.name, deployment.environment, cloud.region, and cloud.account.id before scaling. This is the kind of decision that feels optional at the start and mandatory six months later.

OpenTelemetry's position in 2026 is settled. According to the Elastic Landscape of Observability 2026 report, 89% of production OTel users consider full specification compliance at least very important when evaluating observability platforms. The question for AWS teams is no longer whether to adopt it. It is how to do so in a way that matches the architecture, the team's operational capacity, and the cost constraints in place.

_Disclaimer: This guide reflects the state of OpenTelemetry and AWS monitoring tooling as of May 2026. OpenTelemetry specifications and AWS service integrations evolve frequently. Check the official documentation for the most current details.
_
FAQs

What is the difference between AWS Distro for OpenTelemetry (ADOT) and the upstream OpenTelemetry Collector?

ADOT is an AWS-tested build of the upstream Collector with AWS-specific components pre-packaged: X-Ray exporters, CloudWatch EMF support, and IAM authentication. The upstream Collector Contrib is community-maintained and gets new components faster. Use ADOT for AWS-native stacks. Use upstream contrib when you need components ADOT has not bundled yet or when routing to non-AWS backends.

Can OpenTelemetry replace AWS CloudWatch entirely?

Not entirely. OTel replaces CloudWatch at the instrumentation and collection layer, but not for AWS-native operational signals like billing alerts, service health events, and EC2 status checks. Most production teams run a hybrid: OTel pipelines feeding a primary observability backend for investigation, with CloudWatch retained for AWS-native alarms.

Does OpenTelemetry work with AWS Lambda, and what is the cold start impact?

Yes. The OTel Lambda layer supports auto-instrumentation for Python, Java, and Node.js. Cold start overhead is typically 50 to 200ms, depending on the runtime and libraries loaded. For latency-sensitive functions, use manual SDK initialization instead. Always profile cold start impact in staging before deploying to production.

How do you monitor AWS RDS with OpenTelemetry without using CloudWatch?

Use the mysqlreceiver or postgresqlreceiver in the OTel Collector to connect directly to the RDS endpoint and pull performance schema metrics, including query latency, connection counts, and replication lag. Both the Collector and RDS instance need to be in the same VPC. Store monitoring credentials in AWS Secrets Manager. This bypasses CloudWatch streaming costs and gives higher-cardinality metrics with a shorter collection interval.

What is the best sampling strategy for OpenTelemetry on AWS at high traffic volume?

Tail-based sampling. It makes the keep-or-drop decision after a full trace is assembled, so error traces and slow traces are always retained regardless of volume. Head-based sampling decides before the request is processed and will sometimes drop traces that contain errors. Start by keeping 100% of error traces and traces exceeding your p99 threshold, then sample healthy traffic at 10 to 20%. Tail-based sampling requires all spans from a trace to arrive at the same Collector instance, so plan for a stateful Collector deployment.

CubeAPM: Evaluating a New Relic Alternative for Cost, Control, and Scale

CubeAPM — Wed, 22 Apr 2026 04:37:33 +0000

*Originally published on cubeapm.com

As organizations adopt cloud-native architectures, Kubernetes, and microservices, systems have become more distributed and generate significantly higher volumes of telemetry than traditional monitoring models were built for.

This article is written for teams running production workloads at scale, especially Kubernetes-based SaaS platforms, who are reassessing whether their current observability setup still aligns with cost predictability, governance requirements, and long-term architectural needs.

New Relic is one of the earliest full-stack observability platforms and continues to offer strong capabilities. However, as systems scale, factors such as cost behavior, data control, and flexibility begin to influence platform decisions. This article explains when and why that re-evaluation happens, and how platforms like CubeAPM fit into the next phase.

How New Relic Shaped Modern Application Observability

New Relic is an observability platform that teams use to monitor their applications and systems. You can use it to track issues, such as slow APIs, errors, or user experience, and so on. It offers capabilities, such as application performance monitoring (APM), log monitoring, infrastructure monitoring, digital experience monitoring, and more. New Relic is one of the earliest observability tools.

When cloud adoption picked up and teams started moving to microservices, many existing monitoring tools fell short. They were host-focused, fragmented, and hard to connect back to real application behavior. New Relic introduced a compelling model:

A single SaaS platform
Unified visibility across metrics, logs, traces, and application performance
Minimal operational overhead
Fast onboarding for engineering teams

This approach fundamentally changed expectations. Engineers began to expect service maps, end-to-end tracing, and correlated signals instead of stitching together multiple tools. Many observability platforms today still follow this model because it solved real problems for growing teams.

For small teams or early-stage products prioritizing speed and simplicity, this SaaS-first approach can still be a strong fit.

Understanding New Relic’s Observability Model as Systems Grow

New Relic still works well for teams that want a SaaS-first experience and fast onboarding. But issues, such as high cost, become a problem when your telemetry volume, service count, and team size increase, although these are not unique to New Relic.

Predicting Cost Becomes Difficult as Telemetry Scales

New Relic’s pricing is usage-based across data ingestion and user access. It may feel simple at first, but multiple cost dimensions can compound quickly as your systems grow.

Data Ingestion Cost

As systems scale:

Telemetry volume increases with traffic and service count
Spikes during incidents or deployments become more common
Monthly usage becomes harder to forecast

Every New Relic account gets 100 GB of data ingest free per month. After that, additional telemetry data is billed at $0.40 per GB under the standard data pricing tier.

There is also an optional “Data Plus” pricing tier with higher per-GB rates ($0.60/GB).

Since logs, metrics, traces, and events all count in this usage pool, spikes in telemetry volume during incidents or deployments can cause the bill to rise higher and can be hard to predict.

User Access Costs

New Relic also charges for user roles:

Basic users (free)
Paid roles such as Core and Full Platform users

As engineering teams grow, user-based pricing introduces a second axis of spend that compounds alongside data ingestion.

Core users are priced at $49 per user per month, and Full Platform users are commonly priced at $99 per user per month in standard pricing and $349 per user per month in the Pro plan.

As teams grow and more engineers need observability access, this user cost adds a second axis of spend that is separate from data ingestion.

Computer Capacity Units (CCUs) Costs

New Relic has introduced a Compute-based pricing model built around Compute Capacity Units (CCUs). Instead of charging per user, costs are tied to the compute consumed by customer-initiated queries and analysis actions, aiming to align spend with actual platform usage.

Key characteristics:

All users get full platform access with no per-user fees
Costs scale with query and analysis activity, not just data ingest
Failed requests and select internal queries are excluded from billing
Compute budgets and alerts help teams monitor CCU usage

Multiple dimensions compounding

New Relic’s billing is not limited to just telemetry volume or hosts. You pay for:

How much data you ingest above the free tier, and
How many engineers need access to the platform

A team that suddenly enables more detailed tracing, increases log verbosity, or invites more users can see costs rise from several angles at once.

Also, small changes in telemetry volume could mean larger-than-expected bills because ingest and user costs both contribute to the total spend.

Here’s what medium-sized businesses may end up paying*:

Data ingestion ($0.40 x 45,000 GB) = $18,000
Full Users (20% of all engineers) = $349 x 10 = $3,490
Observability Cost = $21,490
Observability data out cost (charged by cloud provider) (45k x $0.10/GB) = $4,500

Total Observability Cost (mid-size team) = $25,990

_*All pricing comparisons are calculated using standardized Small/Medium/Large team profiles defined in our internal benchmarking sheet, based on fixed log, metrics, trace, and retention assumptions. Actual pricing may vary by usage, region, and plan structure. Please confirm current pricing with each vendor.
_

Why forecasting cost becomes harder:

Early on, free data credits and a small number of paid users make monthly costs feel predictable.

But once telemetry volume grows beyond the free 100 GB and user counts increase, finance and engineering teams often have to track multiple pricing levers instead of a single predictable cost line.

This makes it harder to align observability value with budget commitments when usage fluctuates monthly.

Why Cost Behavior Matters More Than Price

Early on, observability costs feel predictable. Free credits and limited usage make monthly spend easy to reason about.

As telemetry scales, however, costs begin to fluctuate based on:

Traffic patterns
Incident frequency
Instrumentation depth
Team size

At this stage, teams care less about headline pricing and more about how costs behave over time whether they grow linearly and predictably, or spike unexpectedly.

This distinction is central to why teams begin evaluating a New Relic alternative.

OpenTelemetry Support vs Vendor-Specific Workflows

An OpenTelemetry-compatible platform means it can ingest OTel data.
An OpenTelemetry-native platform means the entire observability pipeline, from data processing to querying and controls, is built around OpenTelemetry concepts.

New Relic fits the first category. It may work fine for some. But it may create issues for teams that are planning long-term portability and vendor-neutral workflows. Let’s understand this deeply.

New Relic supports OpenTelemetry (OTel). You can send metrics, logs, and traces using OTel SDKs and the Collector. This means you are not forced to use a proprietary agent just to get data in. Teams that standardize OpenTelemetry can easily adopt it without lock-ins.

Where things start to feel different is after the data lands. In daily work, most teams still rely heavily on New Relic’s own workflows. Dashboards are built using NRQL. Alerts are defined around New Relic entities. Troubleshooting usually happens inside New Relic’s UI using its views and concepts. Over time, that team became heavily reliant on them.

This creates an important distinction.

Although OpenTelemetry is there to keep you flexible at the collection layer, many operational workflows are still tied to the platform itself.
Moving raw telemetry somewhere else is possible. However, recreating years of dashboards, alerts, and runbooks is much harder.
That makes switching the tool difficult for teams.

SaaS-Only Platform

New Relic is a pure SaaS-based platform. Customers don’t have to run the infrastructure themselves. It’s easy to set up, so they can start monitoring systems right away. For many teams, this simplicity is a major advantage early on.

But when systems and requirements grow, trade-offs start becoming more visible:

Because the platform is SaaS-only, all telemetry data is stored and processed in New Relic’s managed infrastructure.
Data residency, access control, and retention are tied to platform defaults and plan tiers.
Meeting internal security, compliance, or regional data requirements can require extra coordination.
Organizations with strict security and compliance standards want more data control. So, they may start exploring options such as BYOC or self-hosted observability to regain control over data.

Complex Alerting

New Relic’s alerting is manageable when services, thresholds, and alerts are fewer. Alerts change per metric or service. So, the number of alerts grows when services add up, which increases maintenance, too.

Threshold-based alerts struggle: New Relic supports static thresholds and baseline alerts. In auto-scaling or high-traffic environments, alerts can be noisy or missed.

Root cause is not always clear during incidents: When multiple services trigger alerts at once, engineers must manually correlate signals to find the root cause. Downstream alerts often appear alongside the real issue, slowing investigation and increasing MTTR.

When dozens or hundreds of services emit alerts at the same time, it is difficult to tell what actually caused the problem versus what is just a symptom. This is often when teams start rethinking alerting strategy. Their interest shifts toward context-driven alerting that considers errors, latency, and service relationships together, with lower noise, without hiding incidents.

Retention Policies

Retention starts to matter once your team needs to look back in time. Day-to-day alerts are useful, but audits, year-over-year analysis, and deep investigations require older data. Retention policies really show their impact here.

With New Relic, retention varies by data type and plan level, and defaults are relatively short unless you extend them:

Most core data like APM, errors, and infrastructure signals are retained for about 8 days by default under standard settings.

Browser session and many telemetry types may also follow similar default retention unless changed in the Data retention UI.

Logs and other event data typically default to 30 days unless you configure extended retention.

With the Data Plus option, teams can extend retention for many data types up to around 90 days or more. Also, you can unlock higher compliance capabilities like longer query periods and export tools.

New Relic also offers live archives for logs. It can store logs for up to 7 years but at additional cost and with separate billing for archive storage and queries.

Because retention periods are tied to pricing tiers and add-ons, teams often face choices like:

Paying more to keep data longer
Keeping only a short history accessible
Or, exporting data elsewhere for compliance

Because of this, many teams may feel retention decisions are driven mostly by cost mechanics rather than engineering or business needs. So, they may begin looking for New Relic alternatives when long-term observability becomes a priority. They want predictable access to historical data without rising vendor costs.

The Inflection Points That Trigger a New Relic Re-Evaluation

Teams rarely replace observability platforms overnight. Re-evaluation usually happens when several signals appear together:

Observability spend starts requiring finance approval: What was once an engineering expense becomes a line item finance wants to forecast. Usage-based ingest, users, and feature tiers start moving together, and monthly costs become harder to explain or predict.
High Kubernetes and service count: Telemetry volume increases with higher Kubernetes and service counts. Dashboards and alerts become slower.
Alert noise: More signals don’t always mean more clarity. Teams notice they are spending more time triaging notifications than fixing the underlying issue, which directly affects MTTR.
Retention and audits: You may need data from months back for reviewing incidents, maintaining compliance, and analysing trends. When retention is based on the plan or available as an add-on, it can add to the cost.
Coupling: Over time, you may feel your dashboards, alerts, and queries have become highly reliant on the platform. When observability tooling starts influencing how services are designed or instrumented, teams pause and reassess long-term flexibility.

These inflection points appear gradually. When several show up together, teams start asking whether it still fits where their system is headed.

Why Teams Consider CubeAPM as a New Relic Alternative

CubeAPM is built on the belief that observability should be owned and controlled like your infrastructure, not metered like a SaaS bill. Here are reasons why many teams consider CubeAPM as a New Relic alternative.

Key differences include:

-** OTel-native architecture**: Teams want to instrument once and keep their options open. With tools that support OpenTelemetry, telemetry is portable across tools. This helps reduce vendor lock-ins.

Predictable pricing: Modern teams care less about the lowest starting price and more about how costs behave over time. Pricing models that scale linearly with usage and avoid compounding dimensions are easier to plan and defend.
Unified MELT: Metrics, events, logs, and traces need to work together. Teams want to quickly get to the root cause without referring to multiple tools or correlating data manually to understand the context.
Self-hosted/BYOC deployment: Runs in customer-controlled infrastructure, offering control over data location, access, and retention.
Unlimited retention: Retains logs, metrics, traces, and events without tier-based limits.
Context-aware, Smart Sampling: CubeAPM offers context-based Smart Sampling that preserves important signals, such as errors and suspicious requests, and drops routine data to reduce noise and storage overhead.
Access to Human support: Direct access to engineers via Slack or WhatsApp during critical situations.
Flexible setups: Teams want multiple setup options. Some teams may prefer SaaS for convenience. Others may need BYOC or self-hosted tools for compliance.
Data control: Observability data can be sensitive. Teams want complete clarity over data storage location, who can access the data, and the retention period.
Developer-friendly: Engineers want fast and reliable data for investigating incidents. They value tools that offer deep context with less noise.
800+ integrations: CubeAPM integrates with many services, languages, frameworks, and data sources. It can easily fit into your current tech stacks.
Zero egress costs: Since data stays within the customer’s cloud or on-prem infrastructure, there are no additional egress charges for moving data out.

If you want to explore feature-wise differences, check out our CubeAPM vs New Relic page.

Migration Strategies for Teams Evolving Beyond Traditional SaaS APM

Teams rarely migrate in one big step. Most successful transitions focus on reducing risk and evolving architecture gradually.

Incremental OTel adoption: Teams usually start by instrumenting new services with OpenTelemetry while leaving existing services unchanged. Teams can standardize instrumentation without full rewrites.
Parallel observability: Teams run multiple observability pipelines side by side. This makes it possible to compare data, validate coverage, and build confidence before committing fully.
Agent reuse: Many teams reuse their current agents or collectors when migrating to another tool. This reduces operational overhead and doesn’t disrupt production systems.
Moving alerts and workflows: Teams can move their alerts, dashboards, and runbooks gradually, instead of all at once. This way, you can keep your vital alerts intact and also keep testing and refining new workflows.

Case Study: Why a SaaS Team Began Evaluating a New Relic Alternative at Scale

Problem

A mid-scale B2B SaaS company running a cloud-native platform adopted New Relic early because it was quick to set up and gave unified visibility across the stack. That worked well in the beginning. Over time, the platform grew to hundreds of services running on Kubernetes and handling high-throughput production workloads. Telemetry volume rose steadily, costs became harder to forecast month to month, and alert noise increased as more services and dependencies were added.

Although New Relic supported OpenTelemetry ingestion, most day-to-day work still depended on New Relic-specific agents, dashboards, and alert definitions. This made it harder for the team to reason about long-term flexibility. At the same time, new enterprise customer requirements introduced stricter expectations around data retention, auditability, and data residency, areas where a SaaS-only deployment offered limited control.

Solution

Instead of replacing New Relic overnight, the team paused and treated observability as an architectural problem. OpenTelemetry became the default for all new services so instrumentation stayed vendor-neutral. They ran parallel observability pipelines in production, using New Relic alongside OpenTelemetry-native platforms. The evaluation focused on practical concerns: how predictable costs were at scale, how retention could be controlled, how flexible deployment options were, and whether alerting and correlation actually helped engineers reach root cause faster.

Results

Over time, more core services moved toward an infrastructure-aligned observability model with less dependence on platform-specific workflows. Observability costs became easier to reason about, alert fatigue dropped, and governance requirements were simpler to meet. The outcome was not about abandoning New Relic, but about aligning observability with long-term scale, control, and sustainability as the system continued to grow.

Conclusion

Observability is entering a more mature phase, where the focus is shifting from basic visibility to long-term sustainability. As systems scale, cost predictability, portability, and control over telemetry data are becoming first-class requirements rather than secondary concerns.

New Relic helped define modern, SaaS-driven observability and set expectations for what full-stack visibility should look like. Platforms like CubeAPM reflect how the next phase is being shaped, with greater emphasis on OpenTelemetry, infrastructure ownership, and observability that scales predictably with the business.

Book a demo to explore CubeAPM.

_Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve.
_

_Editorial Note: This analysis is based on CubeAPM’s experience working with SaaS and enterprise teams running large-scale, Kubernetes-based production systems and evaluating observability architectures over time.
_

FAQS

1. Is CubeAPM a New Relic alternative?

Yes. CubeAPM is commonly evaluated as an alternative when teams want more control over deployment, data retention, and pricing behavior. While both platforms provide full-stack observability, CubeAPM takes a self-hosted, OpenTelemetry-native approach, whereas New Relic operates as a SaaS platform.

2. Does CubeAPM support OpenTelemetry?

Yes. CubeAPM is OpenTelemetry-native and supports ingesting traces, metrics, and logs directly from OpenTelemetry SDKs and collectors. This makes it easier for teams standardizing on OpenTelemetry to avoid vendor-specific instrumentation.

3. Is CubeAPM self-hosted or SaaS?

CubeAPM is self-hosted or deployed in a customer-controlled VPC. The platform runs inside your own cloud environment, giving teams control over data location, access, and retention, while still being vendor-managed from an operational perspective.

4. How do teams migrate from New Relic?

Most teams migrate incrementally. They keep existing instrumentation running while introducing OpenTelemetry collectors or compatible agents that send data to CubeAPM in parallel. This allows validation of dashboards, traces, and alerts before fully switching traffic.

5. How does observability pricing differ?

Pricing models differ in how costs scale. SaaS platforms typically combine usage-based ingest with user or feature-based charges. CubeAPM uses predictable, ingestion-based pricing with unlimited retention, which simplifies forecasting as telemetry volume grows.

6. Does CubeAPM replace New Relic, or can they run together?

They can run together during evaluation or migration. Some teams use both temporarily to compare data quality and cost behavior. Over time, teams usually consolidate once they are confident CubeAPM meets their observability and operational requirements.

CubeAPM: A Modern Datadog Alternative for Cost, Control, and Scalable Observability

CubeAPM — Wed, 11 Mar 2026 18:51:44 +0000

_Disclaimer: This post was originally published on cubeapm.com
_
Organizations globally are quickly adopting cloud, microservices, and distributed systems. All of this generates huge data volumes, and monitoring it can be difficult for teams using legacy monitoring tools.

Some common issues are related to high cost, data residency, management complexity, and vendor lock-ins. For example, Datadog is a leader in observability, but it might not suit all teams as its pricing is high and it only supports SaaS deployment, with no self-hosting.

To make observability affordable and flexible, teams may start looking for Datadog alternatives, such as CubeAPM. Let’s understand this in detail.

How Datadog Became the Industry Default for Observability

Datadog is an observability platform that teams can use to monitor their applications’ health and performance. It brings together metrics, logs, traces, application performance monitoring (APM), real user monitoring (RUM), and synthetics in a single, SaaS-based product. For many teams, Datadog became the first tool they turned to when production systems started to grow beyond what traditional monitoring could handle.

When teams moved to the cloud and started using microservices, their systems became more complex and visibility was reduced. Teams needed a way to understand what was happening across services without stitching together multiple tools or managing their own monitoring infrastructure. Datadog offered a platform that made it easier for companies to visualize, monitor, and track those systems.

Over time, Datadog helped define what people now think of as full-stack observability. Teams want to be able to correlate their logs, metrics, and traces in one place. Engineers learned to move from a dashboard to the exact problem without switching tools. That set the standard for observability platforms. Datadog is still widely used because it solves real problems. Its 900+ integrations reduce friction during adoption, and many teams value its fast time-to-value.

Limitations of Datadog’s Current Observability Model

Datadog is still a capable and heavily used observability platform. However, it has certain limitations that you may start noticing once your application architecture's scale and observability requirements grow. That said, these challenges are not unique to Datadog. They reflect broader patterns seen across first-generation, SaaS-based observability platforms.

Datadog’s Cost Predictability Breaks Down at Scale

Datadog’s usage-based pricing can feel manageable at smaller scales. Teams pay per host, per gigabyte, or per event, and monthly costs appear predictable. The challenge emerges as systems grow and observability data starts scaling across multiple pricing dimensions. Using the cost model from the pricing sheet, the issue becomes clear.

APM: Datadog charges per host per month for APM. For a midsize business, Datadog’s APM Pro is billed at $35 per host, along with additional charges for profiled hosts ($40 per host) and profiled container hosts ($2 per host). Also, Datadog charges for indexed spans after the first included million. Indexing 500 million spans per month at $1.70 per million spans becomes a significant part of the bill. As traffic grows, span volume increases faster and cost estimation becomes difficult.
Infrastructure monitoring: Datadog charges $15 per host per month for infra monitoring, and container usage is $0.002 per-container-hour (after 5 containers/host). There’s a separate charge for custom metrics ($0.01 per metric after 100/host).
Log management: Logs are billed twice: once for ingestion and again for indexing. In the midsize scenario, ingesting logs per month at $0.10 per GB is only part of the cost. Indexing 3.5 billion log events at $1.70 per million events adds a separate, independent charge.

When you combine these components, the total observability cost grows across hosts, containers, spans, metrics, logs, and events, each with its own pricing unit. Here’s what the cost looks like for medium-sized businesses:

*All pricing comparisons are calculated using standardized Small/Medium/Large team profiles defined in our internal benchmarking sheet, based on fixed log, metrics, trace, and retention assumptions. Actual pricing may vary by usage, region, and plan structure. Please confirm current pricing with each vendor.

The key issue is that cost is distributed across many independent dimensions. A new service, additional trace sampling, higher log indexing, or increased metric cardinality can each increase the bill in ways that are difficult to anticipate upfront.

Platform Coupling and Long-Term Lock-In

Datadog supports modern instrumentation standards like OpenTelemetry. Teams can send OpenTelemetry metrics, traces, and logs using OTLP, either through the Datadog Agent or via the OpenTelemetry Collector with a Datadog exporter.

In day-to-day use, most teams still depend on the Datadog Agent and Datadog-specific workflows. The Agent enriches telemetry with host metadata, applies tagging conventions, and connects directly to dashboards, monitors, and alerts.

Teams may also develop Datdog-specific habits and knowledge. Moving away may need you to set up dashboards, retrain your team, and rewrite alert logic.
Datadog’s OpenTelemetry support is real and widely used, but it functions as a supported ingestion option rather than the platform’s native control plane.
Many advanced features work best when data flows through Datadog’s own collection and enrichment paths.

So, teams may start using a hybrid setup: OpenTelemetry instrumentation with workflows specific to Datadog.

Mostly a SaaS Platform

Datadog is mostly available as a managed SaaS platform, which is easier to adopt. It stores and processes telemetry data in Datadog-managed systems. For many teams, this works well at first. But when priorities change, or observability data directly influences reliability, security, and cost, teams want to know where that data lives and who controls it. If observability data leaves an organization’s infrastructure, meeting data residency and access controls requirements becomes harder.

At present, Datadog’s on-premises initiatives like CloudPrem are limited in scope. It currently focuses on specific use cases, such as self-hosted log management, rather than a fully self-hosted deployment of the entire platform. Core backend services, including metrics, traces, analytics, and the unified UI, are still SaaS-based and hosted by Datadog’s cloud infrastructure.

Limited Retention

Datadog has different retention periods according to the type of telemetry data you collect. If you need higher retention, it requires extra cost or configuration. Based on Datadog’s official documentation, retention periods vary:

Metrics: Stores metrics for up to 15 months.
*Traces *(APM): By default, it stores indexed spans for 15 days. Unindexed traces are available only in short-lived live views before they expire.
Real User Monitoring (RUM): Retains user sessions, actions, and errors for 30 days.

Default retention periods are useful for recent troubleshooting and visibility. It could feel limiting when you need to look further back. Once the retention period expires, you can no longer see patterns in performance regressions, reliability, or usage.

As systems mature, historical observability data becomes more valuable for capacity planning, long-term reliability analysis, and post-incident reviews. When retention is limited, teams must either export and archive data elsewhere or accept gaps in historical visibility.

Noisy Alerts at Scale

Datadog offers a flexible alerting system via monitors. It can trigger alerts based on metrics, logs, traces, and composite conditions. Monitors grow as your infrastructure grows. Even if there’s a small change in traffic patterns or thresholds, it can trigger too many alerts. Many of these may not require you to take action. But alert fatigue can happen due to excessive alerts.

When engineers receive too many low-value alerts, they might miss important signals or have trust issues with the alerting system. This is why you need to constantly tune and control alert noise, although Datadog provides tools for grouping, muting, and managing alerts.

Support Response Times Vary

Based on Datadog’s official support documentation, there are multiple support tiers and response times:

Standard Support: The support is via email and chat. For critical issues, response time is within 2 hours (24x7). For lower priority cases, it’s 12-48 hours (during business hours).
Premier Support: It’s a paid add-on and costs 8% of your monthly Datadog spend ($2,000 minimum). It includes 24x7 support via email, phone, and chat. It offers faster response targets: 30 minutes for critical and 4-12 hrs for lower priority issues.

So, if you’re on the Standard tier, you may face slower responses for non-critical issues outside business hours. But every team desires a faster and more responsive support team, as incidents may occur at any time.

What Does Modern Observability Teams Require Today?

Observability influences cost, reliability, security, and even how teams work. So, modern teams have become more conscious about what they want from an observability platform.

OpenTelemetry support: Teams want vendor-neutral instrumentation. They want to be able to easily adjust pipelines, backends, or architectures without rewriting code. OTel-first tools also make it easier to control sampling, filtering, and routing data before it reaches the backend of the observability tool.

Predictable pricing: Teams want pricing models that they can control. Cost should not spike unexpectedly because of hidden costs, add-ons, or aggressive log indexing and trace volume changes.
MELT correlation: Metrics, events, logs, and traces should work together effortlessly. Engineers want to quickly move from a symptom to a root cause, without jumping between tools or losing context.
Self-hosting: Organizations want control over data egress, retention, and residency. Options, such as self-hosting or BYOC deployments, are helpful here. It’s important for teams that want to meet security or compliance requirements, irrespective of their size.
Lower alert noise: Low-noise alerting, clear signals, and fast root-cause analysis are important. So, the goal is fewer alerts that actually mean something, and workflows that help engineers fix issues quickly instead of adding operational friction.

In our experience working with teams running high-throughput, distributed systems, observability failures rarely happen because a tool is missing features. They happen because cost controls, sampling decisions, or data movement constraints were not considered early. These issues typically surface only after scale, during incidents, or when finance starts questioning observability spend.

How CubeAPM Fits as a Modern Datadog Alternative

CubeAPM eases the difficulties that modern teams face now in terms of data control, flexibility, and cost predictability to teams as their requirements increase. Here is what makes CubeAPM an excellent Datadog alternative:

Native OpenTelemetry: CubeAPM supports OpenTelemetry natively, not just as a supporting feature. It’s the primary way to collect and process telemetry. Teams can standardize on one instrumentation method across services while retaining the freedom to add data pipelines or backends over time.
Predictable pricing: CubeAPM has a predictable, transparent, and affordable pricing of just $0.15/GB. It involves no per-host, per-user, or per-feature pricing. CubeAPM also doesn’t charge separately for data retention, support, or unnecessary add-ons. This way, you can save up to 60% in observability costs.

Smart, context-based sampling: CubeAPM uses smart sampling that preserves important context while reducing unnecessary data volume. It also compresses data by 95%, so teams can save big on data storage costs. Still, they don’t lose visibility into important data useful for debugging and root-cause analysis (RCA).
Unlimited data retention: CubeAPM supports unlimited retention, so teams can keep historical observability data for as long as they need. This is useful for analyzing long-term trends, incidents, and capacity planning.
Zero cloud egress costs: Because CubeAPM can run inside customer-managed environments, teams avoid unexpected cloud egress fees when moving observability data out of their infrastructure. Many enterprises usually pay cloud providers around $0.10/GB or 20-30% of the total observability bill as data out cost.
Faster turnaround times: CubeAPM offers support via Slack or WhatsApp with faster turnaround times (in minutes) involving core developers. For teams operating critical systems, timely support is a must, particularly during incidents.
800+ integrations: CubeAPM integrates with many cloud services, frameworks, and tools. This is helpful for teams and also saves them time on custom integrations or pipelines.
Self-hosted (BYOC/on-prem): CubeAPM supports deployment models where teams can run the platform in their own cloud (BYOD) or on-premise infrastructure. This way, organizations can control data residency for security and compliance purposes.
Vendor-managed: CubeAPM offers self-hosting with vendor-managed services. This eases operations for engineering teams. They can control data without running or maintaining the infrastructure.
Quick deployment: CubeAPM is easy to deploy for all teams, and its support team is available 24x7 to help you. Teams can quickly get started with OpenTelemetry-native pipelines and simpler cost controls. This means you don't have to spend weeks tuning agents or setting ingestion rules.

Taken together, CubeAPM fits teams that want solid observability with better visibility, cost, and control. It helps build a foundation that remains usable as systems and organizations mature.

Migration Strategies for Modern Teams

Moving observability from one platform to another may seem difficult. But CubeAPM is created with modern standards like OpenTelemetry. So, migrating from Datadog to CubeAPM gets easier. It is usually a gradual, controlled process where both systems can run alongside each other until you validate everything and are ready to switch. Here is a clear and practical way around this:

Reuse existing instrumentation where possible: CubeAPM can receive data directly from applications already instrumented with Datadog agents or other common agents. This means you often don’t have to switch instrumentation immediately.
Run in parallel with your current setup: During migration, you send telemetry to both Datadog and CubeAPM at the same time. This lets you compare views and alerts side by side without breaking anything in production.
Onboard services gradually: Try not to switch all applications at once. Migrate one service one by one, starting with a low-risk service. Next, verify dashboards and alerts in CubeAPM. Now, you can onboard more services to reduce risks.
Rebuild important dashboards: Not all dashboards or alerts are important. Rebuild only the important views to have cleaner, more focused dashboards and alerting logic.
Keep learning: Let telemetry flow to both systems. You can explore CubeAPM at your own pace. Over time, usage shifts naturally as you become more comfortable with the new tool.
Future visibility: Migration is about forward visibility. But many think it’s all about exporting historical data. Trying to move years of old data usually adds cost and complexity without much benefit. So, keep older data where it is and let CubeAPM start collecting fresh telemetry.

This is how you can make your migration smooth rather than a risky overhaul.

Case Study: A $65M Bill and What It Tells Us About Datadog’s Observability Costs at Scale

In 2023, Datadog’s earnings call included a remark that grabbed attention in the tech community. The company reported a large upfront bill, which analysts estimated to be around US$65 million for a single customer in one year. Many analysts connected the case to Coinbase, a cryptocurrency platform.

What were the Reasons Behind the Bill?

High growth: In 2021, Coinbase’s usage growth increased around its IPO. It focuses more on reliability and visibility than on optimising the cost.
High data usage and ingestion: Many observability platforms, including Datadog, charge based on hosts, custom metrics, indexed logs, and trace volume. When usage increases, the bill can grow faster.
Costly at scale: Public conversation and threads noted that bills of tens of thousands per month can feel manageable until they become hundreds of thousands or millions. This affects budgeting and forecasting for engineering and finance teams.

What teams learned from it?

Cost predictability matters: Teams want to predict observability cost accurately, and not to be surprised by enormous bills at year-end. In environments where telemetry grows fast, knowing how cost scales can influence architectural decisions.
Instrumentation influences cost: Pushing very high volumes of logs, traces, or high-cardinality metrics without sampling, filtering, or data governance can accelerate cost growth.
Teams consider alternatives when the spend crosses limits: Once costs exceed certain levels (often discussed by engineers when they hit $2–5 million per year), teams start looking for better alternatives.

Conclusion

As your systems and telemetry grow, the observability tool must offer clear insight into what’s happening, why it’s happening, and what it costs, without surprises.

Modern teams need an observability platform with predictable cost, control over data and deployment, and no vendor lock-ins, among others. CubeAPM’s support deep visibility with native OTel support, affordable cost, smart sampling, self-hosting, unlimited data retention, and faster, responsive support.

Book a demo with CubeAPM to experience this in real-time.

_Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve.
_

FAQs

What are the best Datadog alternatives for modern observability?

The best Datadog alternatives depend on what you are optimizing for. Teams often look for platforms that offer OpenTelemetry-first instrumentation, predictable pricing, and flexible deployment options such as self-hosted or BYOC. Some alternatives focus on enterprise automation, others on open standards and cost control. The right choice depends on scale, compliance needs, and how much control teams want over telemetry data.

Why do teams look for alternatives to Datadog?

Teams typically explore Datadog alternatives due to cost predictability challenges at scale, concerns around vendor lock-in, and the desire for more control over telemetry pipelines. As environments grow, usage-based pricing and platform coupling can make forecasting and long-term planning harder, especially for organizations with strict governance or data residency requirements.

Are Datadog alternatives compatible with OpenTelemetry?

Most modern Datadog alternatives are built to support OpenTelemetry, either as a first-class design principle or as a core ingestion standard. This allows teams to use vendor-neutral instrumentation and reduces dependency on proprietary SDKs. OpenTelemetry compatibility has become a baseline requirement when evaluating observability platforms.

Can Datadog alternatives support large-scale or enterprise environments?

Yes. Many Datadog alternatives are designed specifically for large-scale, cloud-native, and enterprise environments. These platforms often emphasize unified metrics, logs, and traces correlation, cost governance features, and flexible deployment models to support growth without introducing operational or financial surprises.

How should teams evaluate Datadog alternatives?

Teams should evaluate Datadog alternatives based on pricing transparency, OpenTelemetry support, deployment flexibility, data control, and ease of migration. It is also important to consider how well the platform supports a long-term observability strategy, including cost management, portability, and alignment with evolving cloud architectures.