DEV Community: SigNoz

AWS CloudWatch vs Azure Monitor: Features, Costs, and Best Fit

Ankit Anand ✨ — Mon, 23 Feb 2026 08:51:56 +0000

AWS CloudWatch and Azure Monitor are the default observability tools for their respective clouds. They cover metrics, logs, traces, alerting, and dashboards, and work well when your workloads live entirely within one provider. The real differences surface when you look at how each tool handles investigation workflows, cost behavior at scale, alerting hygiene, and portability to other environments.

This comparison evaluates CloudWatch and Azure Monitor across six aspects that matter most during actual operations: ecosystem fit, telemetry coverage, query and investigation experience, alerting, OpenTelemetry support, and cost governance. The goal is to help you decide which tool fits your cloud footprint and operating model, not to declare a universal winner.

Fit and Ecosystem

The strongest argument for either tool is the same. It works best inside its own cloud. The question is how deep that integration goes, and where it stops.

CloudWatch

CloudWatch is built into the AWS control planeand most AWS services (EC2, Lambda, RDS, ECS, S3, and others) publish metrics to CloudWatch automatically. You do not need to install agents or configure exporters to get baseline infrastructure visibility for managed services.

The integrations go beyond metrics collection. CloudWatch alarms can trigger Auto Scaling policies, Lambda functions, EC2 actions, and SNS notifications directly. EventBridge routes CloudWatch alarm state changes into broader event-driven workflows. IAM controls who can view metrics, query logs, and modify alarms. CloudWatch supports cross-account observability (commonly used with AWS Organizations, but also available for individually linked accounts) so a monitoring account can view metrics, logs, and traces from linked source accounts.

CloudWatch automatically collects metrics from AWS services. Each service appears as a namespace in the All Metrics view.

There are instances where CloudWatch is weaker like it is not designed as a cloud-neutral control plane. If you run workloads across AWS and another provider, CloudWatch can ingest custom metrics from external sources, but you lose the zero-setup automatic collection that AWS services get. External metrics require you to publish data through the CloudWatch API or agent, configure custom namespaces, and manage the ingestion pipeline yourself. There are no pre-built dashboards, no automatic alarm recommendations, and no native service map integration for non-AWS resources. Teams with genuine multi-cloud footprints typically need an additional layer to unify their monitoring workflows.

Azure Monitor

Azure Monitor is the observability platform for Azure resources, applications, and hybrid environments. It integrates tightly with the Azure resource model. Platform logs from Azure services are routed via diagnostic settings, while agent-based collection (for example, VM guest telemetry via Azure Monitor Agent) is governed by Data Collection Rules (DCRs).

Azure Monitor respects Azure RBAC (Role-Based Access Control), subscription boundaries, and resource group structures. This makes it easier to enforce access policies, scope monitoring to specific teams, and align observability with existing Azure governance hierarchies.

Azure Monitor surfaces observability data across Azure resources, with built-in governance alignment through RBAC and resource groups. (Source: Microsoft Azure Docs)

Azure Monitor can also monitor on-premises and other cloud workloads. In practice, the experience is deepest for Azure-native resources. Non-Azure workload monitoring is possible through agents, OpenTelemetry, and custom ingestion paths, but the setup effort and data fidelity vary depending on what you are onboarding.

Telemetry and APM Coverage

Both tools cover metrics, logs, and application traces. The differences are in how each organizes telemetry collection, what ships by default, and how much setup work remains after initial onboarding.

CloudWatch

Metrics: AWS services publish metrics to CloudWatch automatically under service-specific namespaces (AWS/EC2, AWS/Lambda, AWS/RDS, etc.). Custom metrics require explicit publishing through the CloudWatch API or the CloudWatch agent. Metric retention follows a tiered model by granularity: 1-minute data points are retained for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days (about 15 months).

Logs: CloudWatch Logs collects log data through log groupsand log streams. Lambda function logs flow automatically. For EC2 instances and on-premise servers, you install the CloudWatch agent and configure which log files to ship. Log data can be routed via subscription filtersto Firehose, Lambda, Kinesis Data Streams, or OpenSearch Service (and from there to S3 through Firehose or Lambda).

CloudWatch Logs organizes data into log groups per service and log streams per instance or container.

Tracing and APM: CloudWatch Application Signals and AWS X-Ray provide application-level observability. Application Signals, a newer addition, surfaces service-level health dashboards built on trace and metric data. X-Ray collects distributed traces across AWS services. Both require instrumentation, either through the AWS Distro for OpenTelemetry (ADOT) or AWS-specific SDKs. The APM depth is solid for AWS-native workloads, but the quality of your trace data depends on instrumentation completeness. Note that CloudWatch handles operational monitoring (performance, errors, resource health), while AWS CloudTrail handles audit logging (who did what, when). If you are evaluating both, see CloudWatch vs CloudTrail for a detailed breakdown of where each fits.

Azure Monitor

Metrics: Azure Monitor Metrics is a time-series database for Azure platform metrics and custom metrics. Platform metrics from Azure services (VMs, App Services, SQL Database, etc.) are collected automatically. Retention for platform metrics is 93 days. Custom metrics have a separate retention behavior and can be sent via the Azure Monitor ingestion API or OpenTelemetry.

Logs: Azure Monitor Logs, powered by Log Analytics workspaces, handles diagnostic logs, activity logs, and custom log data. You configure what data to collect and where to send it through Data Collection Rules (DCRs), which give you granular control over ingestion. DCRs matter for both observability quality and cost control, since they determine what gets ingested into your workspace.

Azure Monitor Metrics Explorer lets you chart platform and custom metrics across Azure resources. (Source: Microsoft Azure Docs)

Tracing and APM: Application Insights, part of Azure Monitor, handles request telemetry, dependency tracking, failure analysis, and performance monitoring. You can instrument applications using the Azure Monitor OpenTelemetry Distro (supporting .NET, Java, Node.js, and Python) or the classic Application Insights SDKs. Application Insights gives you an application-centric view with service maps, transaction search, and failure diagnostics.

Query and Investigation Experience

During an incident, the speed at which you can move from an alert to a root cause depends on your query tools and how smoothly you can correlate across signals. CloudWatch and Azure Monitor take different approaches here.

CloudWatch

CloudWatch provides two distinct query interfaces for different data types.

CloudWatch Logs Insights is the primary tool for log investigation. It supports filtering, parsing, aggregation, and visualization of log data. You write queries using a purpose-built syntax with commands like fields, filter, stats, sort, and parse. It is effective for answering questions like "which Lambda function threw the most errors in the last hour" or "show me all log events matching a specific request ID."

CloudWatch Logs Insights supports structured queries with filtering, aggregation, and time-series visualization of log data.

CloudWatch Metrics Insights supports a SQL-like syntax for querying metrics across namespaces. It is useful for high-cardinalitymetric analysis, like finding the top 10 EC2 instances by CPU utilization or comparing latency across Lambda functions.

The investigation gap is context switching. Metrics, logs, and traces live in separate views within the CloudWatch console. During incident triage, you may find yourself jumping between Logs Insights, the Metrics explorer, and X-Ray trace views to build a complete picture. Each tool works well individually, but correlating signals across them takes manual navigation.

Azure Monitor

Azure Monitor's investigation experience is centered on Kusto Query Language (KQL), a powerful query language used across Log Analytics workspaces.

KQL supports rich operations: filtering, aggregation, joins, time-series analysis, rendering charts, and pattern detection. If your team already works with KQL (common in organizations using Microsoft Sentinel or Azure Data Explorer), the ramp-up time is minimal. If not, the learning curve is real, but KQL's expressiveness pays off once your team is comfortable with it.

KQL queries in Azure Monitor Logs support aggregation, time-series analysis, and inline chart rendering. (Source: Microsoft Azure Docs)

Azure Monitor can also run KQL queries across multiple Log Analytics workspaces and subscriptions. Cross-tenant queryingis also possible when you have delegated access through Azure Lighthouse and appropriate RBAC permissions. This is valuable for organizations running centralized operations teams or managed-service providers that need visibility across different business units or customer environments.

Azure Monitor Workbooks add another layer, providing shareable visual reports that combine metrics, logs, and parameter-driven queries into a single view. They serve as investigation templates that teams can reuse during incidents.

Workbooks combine metrics, log queries, and visualizations into reusable investigation surfaces. (Source: Microsoft Azure Docs)

The tradeoff is that KQL's power comes with workspace design discipline. Query performance and cost depend on how you structure your workspaces, what data you ingest, and how you write your queries. Poorly designed workspace architectures can make investigation slower and more expensive.

Alerting and Incident Response

Alerting is where monitoring turns into operations. Both tools provide mature alerting capabilities for their ecosystems, but the operational patterns differ.

CloudWatch

CloudWatch alarms evaluate metric conditions and trigger actions when thresholds are breached.

The alarm types cover common use cases:

Metric alarms fire when a metric crosses a threshold for a specified number of evaluation periods.
Composite alarms combine multiple alarm states using boolean logic (AND, OR, NOT), helping reduce noise by requiring multiple conditions before triggering.
Anomaly detection alarms use machine learning models to set dynamic baselines, so you do not need to manually define thresholds for metrics with variable patterns.

CloudWatch alarms support static thresholds, composite conditions, and anomaly detection baselines.

Alarm actions integrate directly with AWS services: SNS for notifications, Auto Scaling for capacity adjustments, EC2 actions (stop, terminate, reboot), and Lambda for custom remediation. This makes CloudWatch alarms a natural fit for automated response workflows within AWS.

The operational challenge at scale is alarm hygiene. Teams that create alarms reactively, without regular cleanup, end up with stale or orphaned alarms that add noise and cost. Composite alarms help reduce noise, but they require upfront design work. Community discussions on Reddit frequently mention surprise alarm costs and the difficulty of tracking which alarms are still useful in large environments.

Azure Monitor

Azure Monitor Alerts provides a unified alerting surface across Azure signals, including platform metrics, log query results, activity log events, and Application Insights data.

Alert rules define the condition, and action groupsdefine what happens when the condition is met. Action groups can route notifications through email, SMS, webhook, Azure Functions, Logic Apps, ITSM connectors, and more. This separation between detection (alert rules) and response (action groups) makes it easier to reuse notification routing across different alert types.

Azure Monitor separates alert detection (rules) from response routing (action groups) for centralized management. (Source: Microsoft Azure Docs)

Alert rules inherit Azure RBAC, so you can scope who creates and manages alerts by resource group or subscription. Alert processing rules let you suppress or redirect alerts during maintenance windows or based on specific conditions.

Similar to CloudWatch, the operational challenge is tuning. Broad signal coverage means you can alert on almost anything, but without disciplined rule design and lifecycle management, alert noise grows. Teams running Azure Monitor at scale invest time in alert rule governance, suppression policies, and regular review of alert quality.

OpenTelemetry and Portability

OpenTelemetry (OTel) is the CNCFstandard for vendor-neutral telemetry collection, covering traces, metrics, and logs. Both CloudWatch and Azure Monitor support OpenTelemetry, but the portability outcome depends on how deeply your operational workflows depend on each platform's native constructs.

CloudWatch

AWS maintains the AWS Distro for OpenTelemetry (ADOT), an AWS-supported distribution of the OpenTelemetry Collector and SDKs. ADOT lets you instrument applications using standard OTel APIs and route telemetry to CloudWatch, X-Ray, or other backends.

If your instrumentation stays OTel-native (using OTel SDKs and semantic conventionsrather than AWS-specific SDK calls), your application code remains portable. You can switch backends by changing the collector configuration without re-instrumenting your services.

The lock-in risk lives in the operations layer. If your team builds runbooks, dashboards, alarm pipelines, and investigation workflows around CloudWatch-specific features (Logs Insights syntax, CloudWatch alarm actions, CloudWatch dashboards), those workflows do not transfer to another platform. The application telemetry may be portable, but the operational muscle memory is not.

Azure Monitor

Azure Monitor offers the Azure Monitor OpenTelemetry Distro, which provides auto-instrumentation and manual instrumentation support for .NET, Java, Node.js, and Python applications. The distro routes OTel telemetry into Application Insights and Log Analytics.

The portability story is similar to CloudWatch. OTel-native instrumentation keeps your application code vendor-neutral. But your KQL queries, Workbook templates, alert rule configurations, and workspace architecture are Azure-specific. If you decide to move to a different backend, you carry over the telemetry pipeline but rebuild the operational layer.

For teams evaluating portability, the practical question is: how much of your total observability investment is in instrumentation versus operations? The more your team invests in platform-specific query patterns, dashboards, and alert workflows, the higher the switching cost, regardless of which cloud you are on.

Cost and Governance

Cost predictability is one of the most common pain points in practitioner discussions about both tools. Neither platform has a simple pricing model, and both can produce surprise bills if you do not manage usage proactively.

CloudWatch

CloudWatch pricing is multi-dimensional. You are billed across several independent axes.

The prices below are for US East (N. Virginia) as of February 2026. CloudWatch pricing is region-specific, so rates differ across AWS regions. Always verify against the official AWS CloudWatch pricing page for your region.

Custom metrics: $0.30 per metric per month (first 10,000), decreasing at higher volumes. Most basic monitoring metrics sent by AWS services by default are free, but detailed monitoring and custom metrics are billed.
Logs ingestion: Starts at $0.50 per GB after the first 5 GB/month free (rates and tiering vary by log type and region).
Logs storage: $0.03 per GB-month for CloudWatch Logs storage (pricing can vary by log class and region). Logs stored beyond the default "never expire" retention setting accumulate costs indefinitely.
Logs Insights queries: $0.005 per GB of data scanned.
Alarms: $0.10 per standard-resolution metric alarm per month, $0.50 per composite alarm. Anomaly detection alarms cost $0.30 per alarm (standard-resolution) or $0.90 (high-resolution) because each one creates three underlying metrics (the original plus upper and lower bounds).
Dashboards: $3.00 per dashboard per month beyond the first 3 free.
API calls: GetMetricData, PutMetricData, and other API calls have per-request pricing.

One thing to watch: enabling advanced features like Container Insights or CloudWatch Synthetics generates charges across multiple buckets simultaneously (logs, metrics, Lambda invocations, S3 storage). Estimating the cost of a feature without accounting for its downstream impact can materially undercount the real spend. For a detailed breakdown of every pricing axis and how they compound, see the complete CloudWatch pricing guide.

CloudWatch costs spread across multiple usage dimensions, making it important to track each axis separately.

The common cost surprises at scale come from three areas. First, log groups without retention policies accumulate storage costs indefinitely. Second, high-cardinality custom metrics (many unique dimension combinations) multiply metric counts faster than expected. Third, heavy Logs Insights querying during incidents can scan large volumes and spike query costs.

Log groups default to 'Never expire' retention, which accumulates storage cost over time. Setting explicit retention policies is a common governance step.

Governance practices that help: enforce retention policies on every log group, control metric cardinality through naming conventions and dimension limits, audit alarms regularly for staleness, and monitor CloudWatch spend as its own cost category in AWS Cost Explorer. For a practical walkthrough of cost-reduction strategies, see CloudWatch cost optimization techniques.

Azure Monitor

Azure Monitor pricing centers on Log Analytics workspace usage, with additional costs for alerting and other platform features.

Like CloudWatch, Azure Monitor pricing is region-specific. The prices below are approximate pay-as-you-go rates as of February 2026. Always verify against the official Azure Monitor pricing page and use the region filter for your workspace location.

The major cost components are:

Log ingestion (Analytics Logs): Billed per GB ingested, with the first 5 GB per billing account per month free. Per-GB rates vary by region, and commitment tiers(starting at 100 GB/day) lower the effective rate.
Log retention: Analytics Logs include 31 days of retention by default (up to 90 days included when Microsoft Sentinel is enabled, depending on plan and tier). Beyond that, interactive retention (up to 2 years) and long-term retention (up to 12 years) are both billed per GB-month at rates that vary by region. Check the Azure pricing calculator for current rates in your workspace region.
Log queries: Query cost is included for Analytics Logs. Queries on Basic and Auxiliary Logsare billed per GB scanned, and querying older or archived data can involve additional restore and search-job charges.
Alerts: Metric alert rules are priced per monitored time-series. Log search alert rules are priced per evaluation frequency.
Data export: Exporting data from Log Analytics to Storage or Event Hub has per-GB costs.

Workspace architecture directly affects cost. A single centralized workspace simplifies querying but can concentrate ingestion costs. Multiple workspaces give you finer cost allocation but add query complexity when you need cross-workspace analysis. DCRs help by letting you control what data gets ingested, filtering out low-value logs before they reach the workspace.

Teams running Azure Monitor at scale should baseline expected log volume and query patterns before large rollouts. Costs can vary significantly depending on how much data you ingest, how long you retain it, and how frequently you query it.

CloudWatch vs Azure Monitor at a Glance

Aspect	AWS CloudWatch	Azure Monitor
Best fit	AWS-first environments	Azure-first environments
Metrics	Auto-published for AWS services, custom via API/agent. Tiered retention (15d/63d/455d by granularity).	Platform metrics auto-collected, 93-day retention. Custom metrics via ingestion API or OTel.
Logs	CloudWatch Logs with log groups/streams. Logs Insights for querying.	Log Analytics workspaces with KQL. DCRs for ingestion control.
APM/Traces	Application Signals + X-Ray. Requires ADOT or AWS SDK instrumentation.	Application Insights. Supports Azure Monitor OTel Distro or classic SDKs.
Query language	Logs Insights (custom syntax) + Metrics Insights (SQL-like)	KQL (Kusto Query Language), cross-workspace capable
Alerting	Metric alarms, composite alarms, anomaly detection. Actions via SNS/Lambda/Auto Scaling.	Alert rules on metrics + logs. Action groups with email/webhook/Functions/Logic Apps routing.
OpenTelemetry	AWS Distro for OpenTelemetry (ADOT)	Azure Monitor OpenTelemetry Distro (.NET, Java, Node.js, Python)
Cost model	Multi-axis: metrics, logs ingest/store, queries, alarms, dashboards, API calls	Centered on Log Analytics ingestion/retention, with alerting and export costs
Cross-account/cross-resource	Cross-account observability for AWS Organizations	Cross-workspace and cross-subscription KQL queries (cross-tenant via Azure Lighthouse)
Governance	IAM-based access, per-resource alarm management	Azure RBAC, DCR-based collection governance, alert processing rules

Choose CloudWatch if your primary workloads run on AWS, your team is already familiar with AWS operational patterns, and you need tight integration with AWS automation services (Auto Scaling, Lambda, EventBridge).

Choose Azure Monitor if your primary workloads run on Azure, your team is comfortable with KQL or willing to invest in learning it, and you need native governance alignment with Azure RBAC and subscription structures.

If your environment is genuinely multi-cloud, standardize your application instrumentation on OpenTelemetry. Keep native cloud monitoring for infrastructure-level signals where it is strongest, and evaluate whether a unified observability layer can reduce the context switching and operational duplication of running two parallel monitoring stacks.

Where SigNoz Fits (If Native Tooling Becomes a Bottleneck)

If your team is AWS-first or Azure-first, native monitoring is usually the fastest starting point. But as environments grow, teams often hit friction points that native tools were not designed to solve.

The three patterns that commonly push teams toward an additional observability layer:

Investigation context switching. During incidents, jumping between CloudWatch Logs Insights, Metrics Explorer, and X-Ray (or between Azure Monitor Logs, Metrics Explorer, and Application Insights) slows down root-cause analysis. A unified platform that correlates metrics, logs, and traces in a single view reduces the clicks between discovery and resolution.

SigNoz trace flamegraph for a request across two services. The span details panel on the right includes direct links to correlated Logs and Metrics, reducing the context switching that happens when investigating across separate tools.

Portability needs. If you operate across AWS and Azure (or plan to), maintaining parallel monitoring stacks with different query languages, alert configurations, and dashboards doubles your operational overhead. An OpenTelemetry-native backend lets you standardize instrumentation and route telemetry from any cloud to one place.
Cost-governance predictability. Both CloudWatch and Azure Monitor have multi-dimensional pricing that can produce surprise bills at scale. Teams looking for simpler, usage-based pricing often evaluate alternatives that charge per GB of ingested data without separate axes for metrics, alarms, dashboards, and API calls.

SigNoz is an OpenTelemetry-native observability platform that unifies metrics, traces, and logs in a single application. It supports both cloud and self-hosted deployments.

For AWS-first teams, SigNoz provides one-click AWS integrations to collect CloudWatch metrics and logs. You can also use manual OpenTelemetry collection paths for finer control over what gets forwarded. For Azure-first teams, SigNoz supports Azure monitoring workflows including centralized ingestion through Event Hub and OTel collectors.

One important caveat: cloud-native forwarding paths (like one-click integrations) still incur provider-side charges for CloudWatch API calls, log delivery, and similar usage. Teams that optimize heavily for cost control usually evaluate the manual collection approach alongside one-click setup.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Conclusion

The choice between CloudWatch and Azure Monitor is driven primarily by your cloud footprint. If your workloads are on AWS, CloudWatch gives you the fastest path to baseline visibility with the deepest native integration. If your workloads are on Azure, Azure Monitor delivers the same advantage within the Azure ecosystem, with added strength in KQL-based analysis and governance alignment.

Where both tools require careful attention is cost governance and operational hygiene. Both can produce surprise bills without proactive retention policies, cardinality management, and alert lifecycle reviews. And both introduce switching costs through platform-specific queries, dashboards, and automation workflows, even when your application instrumentation uses OpenTelemetry.

If you are evaluating both tools, run a time-boxed pilot (2-4 weeks) against real workloads. Measure three things: how quickly your team can go from alert to root cause, how predictable the costs are at your expected log and metric volume, and how much operational overhead the tool adds to your incident response process. Those three signals will tell you more than any feature comparison matrix.

Hope we answered all your questions regarding AWS CloudWatch vs Azure Monitor. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

AWS X-Ray vs CloudWatch Explained: Metrics, Logs, Traces, and When to Use Each

Ankit Anand ✨ — Mon, 23 Feb 2026 08:51:24 +0000

AWS CloudWatch and AWS X-Ray are both AWS-native observability tools, but they answer different questions. CloudWatch tells you what is unhealthy in your system. X-Ray tells you where in a request path the problem happened.

They are complementary, not interchangeable. Many AWS teams run both - CloudWatch for metrics, logs, and alerting, and X-Ray for distributed tracing when they need to isolate latency or failures across service boundaries.

Below, we cover how each tool works, where they overlap, how to use them together in practice, and when the AWS-native stack starts showing its limits.

What is Amazon CloudWatch

Amazon CloudWatch is AWS's monitoring and observability service for infrastructure and applications. It collects metrics, aggregates logs, triggers alarms, and powers dashboards across virtually every AWS service.

CloudWatch answers operational questions like:

Is this EC2 instance healthy right now?
Which Lambda function is throwing errors?
Did CPU utilization cross the threshold I set?
What do the logs say for this service in the last 30 minutes?

How CloudWatch works

Most AWS services send metrics to CloudWatch automatically. EC2 instances publish CPU, disk I/O, and network metrics out of the box; for memory and filesystem-level metrics, you need the CloudWatch Agent installed on the instance. Lambda functions publish invocation count, duration, and error rate. You do not need to install anything extra for this baseline data, which makes CloudWatch the default starting point for operational monitoring on AWS.

For logs, Lambda writes to CloudWatch Log Groups by default. ECS and API Gateway can do the same, but require configuration: ECS needs the awslogs log driver set in your task definition, and API Gateway needs logging enabled at the stage level.

Each log group contains log streams (one per function instance, container, or source), and each stream holds individual log events with timestamps. The log event detail is rich: you get exact error messages, stack traces, and runtime context without any additional instrumentation.

CloudWatch log events from a Lambda function. Each event includes a full timestamp and message payload.

CloudWatch also includes Logs Insights, a query engine for searching and aggregating log data across log groups. This is where CloudWatch becomes operationally useful for incidents, because raw log streams at high volume are difficult to navigate manually. With Logs Insights, you can write SQL-like queries to filter events, extract fields, and visualize patterns across time.

CloudWatch Logs Insights lets you query and visualize log data with a SQL-like syntax, useful for aggregating patterns across high-volume log groups.

Alarm-based workflows are another core strength. You can set threshold-based or anomaly detection alarms on any CloudWatch metric, then trigger SNS notifications, auto-scaling actions, or Lambda-based remediation. For teams that primarily need infrastructure health monitoring and log retention, CloudWatch covers a lot of ground with minimal setup.

Where CloudWatch falls short

The biggest gap shows up when you need to debug distributed latency. CloudWatch logs can tell you that a request took 500ms, but not which downstream call added the delay. If your application calls three other services before responding, the logs from each service live in separate log groups with no built-in way to stitch them into a single request timeline.

Investigation during incidents also involves context switching. Logs, metrics, alarms, and traces (if you use X-Ray) live in separate sections of the CloudWatch console. Each section has its own query interface and navigation. During an outage, rebuilding your mental model as you jump between these views adds time to resolution.

This is a well-documented frustration among AWS engineers. In this r/aws thread, the core complaint is not about missing data but about investigation ergonomics at scale. Teams describe jumping across many log groups and streams during incidents, struggling to piece together what happened.

A Reddit thread where an AWS engineer describes the friction of investigating issues across CloudWatch Logs at scale.

CloudWatch cost model

CloudWatch uses pay-as-you-go pricing with no upfront commitment. The main cost levers are:

Custom metrics: Each unique namespace + metric name + dimension combination counts as a separate metric. High-cardinality tags can increase costs quickly.
Log ingestion and storage: Charged per GB ingested. Log groups retain data indefinitely by default, and storage costs apply for the entire retained volume. Setting shorter retention policies is a common cost optimization.
Logs Insights queries: Charged per GB of data scanned.
Alarms: Charged per alarm metric per month.

The free tier (per month) includes basic monitoring metrics from AWS services, 10 metrics (custom metrics + detailed monitoring metrics), 5 GB of log data (ingestion + archive storage + Logs Insights scanned), 10 alarm metrics, and 3 dashboards.

For a detailed breakdown and optimisation techniques, see the CloudWatch pricing guide.

What is AWS X-Ray

AWS X-Ray is AWS's distributed tracing service. It captures request-level data as transactions flow through your application, recording the path, timing, and status of each segment along the way. A segment represents a service's work on a request, and can contain subsegments for downstream calls like HTTP requests, database queries, or AWS SDK calls within that service.

X-Ray answers questions like:

Which part of this request was slow?
Which downstream service call failed?
How are my services connected in this transaction?
What percentage of requests to /enrich take longer than 200ms?

How X-Ray works

X-Ray collects trace data through instrumentation. Each incoming request gets a trace ID, and as the request passes through services (Lambda functions, API Gateway, EC2 instances, DynamoDB calls), each service adds a segment to the trace. Within a segment, subsegments capture downstream calls (HTTP requests, database queries, AWS SDK operations) with their own timing and metadata.

You can instrument with the X-Ray SDK, the AWS Distro for OpenTelemetry (ADOT), or Powertools for AWS Lambda. For Lambda, enabling Active Tracing is a single toggle. For EC2 or ECS, you run the X-Ray daemon or the ADOT collector as a sidecar.

The trace data flows into the X-Ray console (accessible within the CloudWatch Traces section), where you get a trace map for service dependency visualization, a filterable trace list, and a trace detail view with a waterfall timeline for each segment. The segment metadata panel is particularly useful: it exposes OTel resource attributes, HTTP route info, runtime details, and status codes for each segment.

X-Ray trace detail for a request through internal-service. The Segments Timeline shows /validate (70ms) and /enrich (170ms) as separate segments. The metadata panel on the right reveals OTel resource attributes, runtime version, and HTTP route information for the selected segment.

This kind of per-segment breakdown is what makes X-Ray valuable for latency investigations. Instead of guessing which service added the delay, you can see the exact time distribution across the request path. The service dependency visualization through the trace map helps teams understand how services connect and where bottlenecks form, and segment-level metadata gives debugging context (status codes, response sizes, annotations) without needing to search through logs.

Where X-Ray falls short

Traces alone do not replace logs. If your investigation requires the exact error message, stack trace, or payload detail, you still need log data. X-Ray shows you where in the request chain something went wrong, but the what (the specific error text or exception) usually lives in CloudWatch Logs.

Trace-to-log correlation is not automatic in the default setup. In the above X-Ray trace detail, the Logs panel can show "No logs to display for these resources" unless log association is explicitly configured. CloudWatch Application Signals can enable automatic trace-to-log correlation for supported runtimes, but it requires enabling Application Signals and using compatible instrumentation. Outside of that, the two tools are complementary by design but require setup work to link their data during investigations.

Missing spans are another real operational issue. If exporters, IAM permissions, or trace propagation are misconfigured, a service may appear in CloudWatch Logs but be invisible in X-Ray. The OTLPExporterError: Not Found pattern observed in this workload's logs is one example: the export pipeline is broken, so traces are incomplete even though log data flows normally.

X-Ray cost model

X-Ray pricing is based on traces recorded and traces retrieved/scanned:

Free tier: First 100,000 traces recorded and first 1,000,000 traces retrieved or scanned per month.
Beyond free tier: Charged per million traces recorded and per million traces retrieved.

In practice, cost control comes down to sampling. X-Ray supports configurable sampling rules so you do not record every single request in high-traffic production environments. A common pattern from AWS engineers is to use a higher sampling rate in dev/staging and a lower rate in production, adjusting based on traffic volume.

AWS X-Ray vs CloudWatch: Key Differences

Dimension	Amazon CloudWatch	AWS X-Ray
Primary purpose	Monitor metrics, logs, alarms, dashboards	Trace request paths and segment-level latency
Best question answered	"What is unhealthy right now?"	"Where exactly did this request slow down or fail?"
Data type	Infrastructure + application metrics, logs, events	Distributed trace data (segments, subsegments, annotations)
Breadth vs depth	Broad, cross-service monitoring	Deep, per-request causality analysis
Production pattern	Always-on baseline monitoring	Sampled tracing with focused investigation
Setup required	Minimal for AWS services (metrics/logs sent by default)	Instrumentation required (SDK, ADOT, or Powertools)
Cost drivers	Custom metrics, log ingestion/storage, Logs Insights queries, alarms	Traces recorded and retrieved, driven by sampling rate
When to use alone	Infra monitoring, alerting, log retention, compliance	Root cause analysis for latency/failure in distributed systems

CloudWatch covers the "monitor everything" layer. X-Ray covers the "debug this specific request" layer. They solve different problems, and many teams benefit from running both.

Using CloudWatch and X-Ray Together

In practice, CloudWatch and X-Ray form a natural workflow for understanding application behavior. Here is how they complement each other, based on a hands-on exploration of an active AWS workload running two services (public-api on EC2 and internal-service on Lambda).

Start with logs to understand application behavior

CloudWatch Logs is typically where you begin. Whether you are investigating an error, verifying a deployment, or understanding request patterns, the log events give you the raw detail of what your application is doing.

In this workload, filtering log events for "Enrichment complete" in the Lambda log group surfaces structured log entries with request_id, path, and duration_ms fields. This is useful for spotting slow requests or understanding throughput patterns at the individual event level.

CloudWatch Log events filtered for "Enrichment complete." The expanded log entry shows structured fields including request_id, path, and duration_ms, making it possible to identify specific slow requests.

Switch to traces for request-path context

Logs tell you what happened in a single service, but they do not show you how a request flowed across services. X-Ray traces add that cross-service context.

From the CloudWatch Traces section, you can query by service name. In this workload, filtering for internal-service returns 304 traces in a 30-minute window, with an average latency of 118ms and ~20.47 requests per minute.

CloudWatch Traces filtered to internal-service. The query returned 304 traces with summary statistics: average latency 118ms, 20.47 requests/min, and zero faults. The response time distribution histogram is visible below.

Drill into a trace to see segment-level timing

Clicking into an individual trace reveals the segment timeline. For this 231ms trace, the breakdown shows internal-service handling /validate in 50ms and /enrich in 150ms. The metadata panel on the right confirms the route and handler type.

This is the kind of information that logs alone cannot provide. You can see exactly where time was spent, without guessing.

Trace detail for a 231ms request. The Segments Timeline shows /validate took 50ms and the /enrich request handler took 150ms. The metadata panel confirms http.route: "/enrich" and express.type: "request_handler".

View the full cross-service request path

For a complete picture, traces show the full path from public-api through to internal-service, including DNS lookups and TLS connections between services. This cross-service view is where X-Ray's value becomes clearest: you can see the entire request lifecycle across service boundaries in a single waterfall.

Full cross-service trace showing public-api (351ms total) calling internal-service (91ms). Individual segments for DNS lookups, TLS connections, and remote HTTP calls are visible in the waterfall, giving a complete picture of where time was spent across the service boundary.

Where SigNoz Fits If You Already Use CloudWatch and X-Ray

CloudWatch and X-Ray are strong AWS-native building blocks. But the investigation friction like switching between logs, traces, and metrics in separate workflows, adds up during incidents. A unified observability platform can reduce that friction.

SigNoz is an OpenTelemetry-native observability platform that brings metrics, traces, and logs into a single interface. If you are already running CloudWatch and X-Ray, SigNoz reduces the number of context switches during incident triage without requiring you to replace your existing setup.

What changes with SigNoz

Instead of jumping from CloudWatch Logs to X-Ray Traces to CloudWatch Logs again, you start from the SigNoz traces explorer. The trace table shows service name, span operation, duration, and status in one queryable view. You can sort, filter, and pivot by span attributes without leaving the page.

SigNoz traces explorer. Root requests from public-api (~308ms) and downstream spans from internal-service (POST /enrich at ~189ms) are visible in the same table with sortable columns.

Clicking into a trace opens the flamegraph view with the full span tree. The span details panel on the right includes a Related Signals section that links directly to correlated Logs and Metrics for that span. That is the context switch that takes multiple clicks and manual correlation in CloudWatch, reduced to a single link in SigNoz.

SigNoz trace flamegraph for a POST /process request through public-api (306ms). The span tree shows middleware, request handler, TLS connect, and the downstream POST /enrich call to internal-service (151ms). The span details panel exposes host, HTTP, and runtime attributes alongside direct links to related Logs and Metrics.

For infrastructure context, SigNoz also provides pre-built dashboards for AWS services. The EC2 Overview dashboard shows CPU utilization, CPU credits, and EBS read/write metrics per instance, filterable by account and region.

SigNoz EC2 Overview dashboard for ap-south-1. CPU utilization, CPU credits, and EBS I/O metrics are visible per instance, with account and region filters at the top.

Two practical adoption paths

Path A: Gradual overlay (lowest risk)

Keep your existing CloudWatch + X-Ray setup.
Send a copy of AWS telemetry to SigNoz for unified analysis using SigNoz's AWS integrations.
Shift dashboards and alerts progressively as you validate the workflow.

Path B: OpenTelemetry-first for application telemetry

Instrument your applications with OpenTelemetry SDKs.
Send OTLP data directly to SigNoz.
Keep CloudWatch for AWS-managed service metrics and log retention where needed.

Cost caveat

If you use one-click integration paths that route data through CloudWatch pipelines, AWS-side ingestion and storage charges still apply. For tighter cost control, use manual OpenTelemetry-based collection that sends data directly to SigNoz, bypassing CloudWatch for application telemetry. See one-click vs manual AWS collection for details.

Why OpenTelemetry matters here

OpenTelemetry is the open-source standard for telemetry instrumentation, and when you instrument with OTel SDKs instead of the X-Ray SDK, your tracing code becomes portable. You can send the same trace data to SigNoz, to X-Ray via ADOT, or to any OTLP-compatible backend without changing application code.

SigNoz is OpenTelemetry-native, so features like trace funnels, external API monitoring, and messaging queue monitoring work directly with OTel semantic conventions. SigNoz pricing is usage-based (per GB for logs/traces and per million metric samples). There is no separate "per-attribute" fee, but any attribute, standard or custom, that increases cardinality can increase time-series count and total samples, and therefore cost.

Frequently Asked Questions

Is X-Ray part of CloudWatch?

X-Ray is a separate AWS service, but AWS has been integrating it more tightly into the CloudWatch console. The X-Ray trace map and traces list are now accessible from within CloudWatch under the "Traces" section, and AWS merged the X-Ray service map and CloudWatch ServiceLens map into a unified trace map. Architecturally, they still handle different data types (traces vs. metrics/logs), and X-Ray trace costs are metered separately.

Can CloudWatch replace X-Ray?

No. CloudWatch handles metrics, logs, alarms, and dashboards. It does not provide distributed tracing. CloudWatch Logs can capture timestamps and error messages, but cannot show you the request-path breakdown across services that X-Ray provides. You need both for complete observability.

Are X-Ray traces just structured logs?

No. Traces capture the causal chain of a request across services, with timing for each segment. Structured logs capture individual events with key-value fields. A trace tells you "this request spent 50ms in /validate and 150ms in /enrich." A log tells you "at 16:09:13, the /enrich handler returned status 200 with duration_ms 150." They solve different parts of the debugging puzzle.

Is X-Ray cost-effective for production?

It depends on your traffic volume and sampling strategy. The free tier covers 100,000 traces per month. Beyond that, costs scale with trace volume. Teams commonly use sampling rates between 1% and 10% in production to keep costs manageable while maintaining enough trace coverage for incident investigation. Low-traffic services can use higher sampling rates without significant cost impact.

CloudWatch vs X-Ray vs CloudTrail: what is the difference?

CloudWatch: Operational monitoring. Metrics, logs, alarms, dashboards for service health.
X-Ray: Distributed tracing. Request-path analysis for latency and failure isolation.
CloudTrail: Audit logging. Records API calls made to AWS services for security, compliance, and governance. CloudTrail answers "who did what and when" at the AWS API level, not at the application request level.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Conclusion

CloudWatch answers "what is unhealthy." X-Ray answers "where in the request path it became unhealthy." Using both together is the standard approach on AWS, and they handle different layers of observability that cannot replace each other.

For many AWS teams, the recommended workflow is CloudWatch for detection and log detail, X-Ray for request-path causality and segment-level timing. If the context switching between these tools slows your team down, adding an OpenTelemetry-native platform like SigNoz can reduce investigation time without requiring you to rip out your existing AWS-native setup.

Hope we answered all your questions regarding AWS X-Ray vs CloudWatch. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

CloudWatch vs Prometheus: Setup, Cost, and Tradeoffs

Ankit Anand ✨ — Mon, 23 Feb 2026 08:51:23 +0000

CloudWatch and Prometheus solve overlapping problems, but they approach monitoring from very different starting points. CloudWatch provides metrics and logs for AWS resources out of the box (and for distributed tracing, provides X-Ray integration). Prometheus is an open-source metrics engine you run and configure yourself, built for flexibility and portability across any environment.

If your stack is mostly AWS-native services like Lambda and managed databases, CloudWatch gets you to baseline visibility fastest. If you run Kubernetes, operate across multiple clouds, or need deep custom metric analysis with PromQL, Prometheus gives you more control.

This article compares CloudWatch and Prometheus through hands-on setup, feature coverage, cost patterns, and real practitioner feedback, so you can decide which fits your operating model.

Setting Up CloudWatch

CloudWatch setup effort varies depending on which AWS service you are monitoring and what level of visibility you need.

Metrics: Built-in for AWS Services

Most AWS services publish metrics to CloudWatch automatically. When a Lambda function receives its first invocation, CloudWatch starts collecting invocation count, duration, error count, and throttle metrics. EC2 instances publish CPU utilization, network traffic, and disk I/O operations by default.

Lambda metrics appear in CloudWatch automatically once the function receives traffic.

These default metrics cover infrastructure-level visibility, but they have gaps. For example in EC2, you do not get memory utilization or disk space usage without additional setup.

Installing the CloudWatch Agent for Detailed Monitoring

To collect memory, disk space, custom application metrics, and logs, you install the CloudWatch agent on your EC2 instances. The recommended approach uses AWS Systems Manager, which runs the agent installation and configuration through a wizard.

The wizard walks you through what to collect (CPU, memory, disk, network, etc.), collection intervals, and which log files to ship.

The CloudWatch agent wizard guides you through metric and log collection setup.

You still need an IAM role with permissions for the agent to publish metrics and logs (and SSM permissions if using Systems Manager), but the wizard handles the rest of the configuration. You select which log files to collect (application logs, system logs, custom paths), and the agent starts shipping them to CloudWatch Logs.

Once the agent runs, custom metrics appear under the CWAgent namespace in CloudWatch. Memory usage, disk space percentage, and log files become queryable.

CloudWatch Logs organizes log data into log groups (per service or application) and log streams (per instance or container). Lambda logs flow automatically. For EC2, the CloudWatch agent ships application and system logs once configured.

Lambda logs flow to CloudWatch Logs automatically, organized by log group and stream.

Setting Up Prometheus (Instrumented App)

Prometheus takes a different approach. It exposes its own internal metrics and supports service discovery mechanisms (like Kubernetes SD), but your application metrics require explicit instrumentation. You expose a /metrics endpoint, configure Prometheus to scrape it, and then query the data using PromQL.

To create a comparable test, we built a minimal Python HTTP service that exposes Prometheus metrics and ran it alongside Prometheus using Docker Compose.

Step 1: Instrument the Application

The sample app uses the prometheus_client library to define two metrics, a counter for total HTTP requests and a histogram for request duration:

from prometheus_client import Counter, Histogram, generate_latest

REQUEST_COUNT = Counter(
    "sample_http_requests_total",
    "Total HTTP requests handled by sample app",
    ["method", "path", "status"],
)

REQUEST_LATENCY = Histogram(
    "sample_http_request_duration_seconds",
    "Request latency in seconds",
    ["method", "path"],
)

The counter increments on every request, tracking the HTTP method, path, and status code as labels. The histogram records how long each request took. Both metrics are exposed at the /metrics endpoint, which Prometheus will scrape.

Step 2: Configure Prometheus to Scrape Targets

The prometheus.yml configuration tells Prometheus where to find metrics and how often to pull them:

global:
  scrape_interval: 5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["prometheus:9090"]

  - job_name: "sample-app"
    static_configs:
      - targets: ["sample-app:8000"]

This configuration scrapes both Prometheus itself and our sample app every 5 seconds. The static_configs block lists the hostname and port for each target.

Step 3: Run the Stack with Docker Compose

A docker-compose.yml file brings up both services:

services:
  sample-app:
    build:
      context: ./app
    container_name: cw-vs-prom-sample-app
    ports:
      - "8000:8000"

  prometheus:
    image: prom/prometheus:v3.9.1
    container_name: cw-vs-prom-prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.enable-lifecycle
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro

Running docker compose up --build starts both containers. Prometheus begins scraping the sample app immediately.

Step 4: Verify Targets and Query Metrics

After startup, the Prometheus Targets page at http://localhost:9091/targets shows both scrape jobs with an "UP" state and recent scrape timestamps.

Prometheus targets page confirming both scrape jobs are healthy and scraping every few seconds.

With some traffic generated against the sample app (a simple curl loop hitting /work), we can query request rates using PromQL. The query rate(sample_http_requests_total[1m]) shows the per-second request rate broken down by path and status:

PromQL query results in table mode, showing per-second request rate for each path and status combination.

Switching to graph mode with a 5-minute window reveals the traffic pattern clearly. The /work endpoint shows burst activity while the /metrics endpoint (scraped by Prometheus itself) stays at a steady baseline.

Prometheus graph view with a 5-minute window. The spike and plateau correspond to our curl traffic generation.

Setup Effort Comparison

CloudWatch gave us visibility for AWS services with zero setup for default metrics, and agent installation for detailed monitoring. Prometheus required us to write application instrumentation, configure scrape targets, run infrastructure, generate traffic, and compose queries.

That said, the Prometheus approach gives you full control over what gets measured and how it gets queried. The PromQL query rate(sample_http_requests_total[1m]) returned results in tens of milliseconds on our test setup, with exactly the label dimensions we defined, nothing more, nothing less.

Observability Coverage: Metrics, Logs, and Traces

CloudWatch and Prometheus differ significantly in how many observability signals they cover natively.

CloudWatch Covers Three Signals

CloudWatch is not just a metrics tool. It is an observability platform that handles:

Metrics: Native AWS service metrics (EC2, Lambda, RDS, ELB, etc.) plus custom metrics via the agent or API
Logs: CloudWatch Logs with log groups, streams, Logs Insights for querying, and live tail for real-time viewing
Traces: AWS X-Ray integration for distributed tracing across Lambda, API Gateway, and other AWS services

For an AWS-native stack, this means you can go from a Lambda error spike in CloudWatch Metrics, to the error logs in CloudWatch Logs, to the trace in X-Ray without leaving the AWS console.

The limitation is that CloudWatch's strength drops sharply outside AWS. Monitoring a Kubernetes cluster on GCP or an on-prem database through CloudWatch is technically possible but awkward and expensive compared to purpose-built tools.

Prometheus Covers One Signal Well

Prometheus is a metrics-first system. It does one thing with depth:

Metrics: Pull-based collection with a rich data model (counters, gauges, histograms, summaries) and PromQL for analysis
Logs: Not supported. You need a companion system like Loki, Elasticsearch, or a centralized logging platform
Traces: Not supported. You need Jaeger, Zipkin, Tempo, or another tracing backend

This single-signal focus is both a strength and a limitation. Prometheus excels at metric collection and analysis. PromQL lets you slice, aggregate, and compare time series in ways that CloudWatch's metric math cannot match. But during an incident, you will inevitably need to context-switch to other tools for logs and traces.

The broader Prometheus ecosystem fills these gaps through integrations. Grafana for visualization, Loki for logs, Tempo for traces, and Alertmanager for alert routing. The tradeoff is that you assemble and operate each of these components yourself.

Querying: Metric Math vs PromQL

CloudWatch provides metric math for combining and transforming metrics using arithmetic expressions, and Logs Insights for querying log data with a SQL-like syntax. Logs Insights is particularly useful during incidents, letting you filter by field, aggregate counts, calculate percentiles, and visualize results across millions of log events.

CloudWatch Logs Insights lets you query and visualize log data with a SQL-like syntax.

Prometheus's PromQL is purpose-built for time-series analysis. It supports label-based filtering, aggregation across dimensions, rate calculations, histogram quantiles, and subqueries. For teams doing complex metric analysis, PromQL is significantly more expressive.

For example, calculating the 95th percentile request latency by endpoint in Prometheus:

histogram_quantile(0.95,
  rate(sample_http_request_duration_seconds_bucket[5m])
)

Achieving the same in CloudWatch requires using metric math with percentile statistics, which is less flexible for custom application metrics.

Alerting: CloudWatch Alarms vs Alertmanager

CloudWatch metric alarms can evaluate a single metric or a metric math expression that combines multiple metrics. You configure the evaluation period, threshold, and comparison operator, then attach actions like SNS notifications, Auto Scaling policies, or EC2 instance operations.

CloudWatch Alarms let you define thresholds per metric and trigger SNS, Auto Scaling, or EC2 actions.

For composite conditions (CPU high AND memory high), you create a composite alarm that references multiple individual alarms. The model is simple but becomes repetitive at scale since each alarm is a standalone resource.

Prometheus handles alerting differently. You define alerting rules in configuration files that evaluate PromQL expressions against incoming metrics. When a rule fires, Prometheus sends the alert to Alertmanager, a separate component that handles deduplication, grouping related alerts, silencing during maintenance windows, and routing notifications to channels like Slack, PagerDuty, or email.

This separation gives Prometheus teams more control over alert routing and noise reduction, but it adds another component to operate and configure.

Cost and Operational Overhead

Cost is one of the most debated aspects of CloudWatch vs Prometheus, and users on Reddit consistently flag it as a deciding factor.

CloudWatch: Usage-Based, Can Creep Up

CloudWatch pricing is usage-based across multiple dimensions and differs across regions:

Metrics: Basic monitoring metrics from AWS services are free. Custom and detailed monitoring metrics are billed per metric-month with volume tiers (e.g., first 10,000 at $0.30, then lower tiers). Each unique dimension combination counts as a separate metric
API requests: $0.01 per 1,000 requests after the free tier (1M requests/month). GetMetricData is always charged and billed by the number of metrics requested
Log ingestion: Starts at $0.50 per GB in US East (varies by region and log type; vended logs from services like Lambda use volume-based tiers). See the CloudWatch pricing guide for a full breakdown
Log storage: $0.03 per GB per month archived
Dashboards: First 3 dashboards (up to 50 metrics each) are free, then $3.00 per dashboard per month
Alarms: $0.10 per standard-resolution alarm metric-month. Composite alarms are $0.50/alarm-month. Anomaly detection alarms bill 3 metrics each

The most common cost surprises come from log ingestion and custom metrics. Custom metric cardinality is multiplicative, so adding a single high-cardinality tag like customer_id with 100 unique values can turn 90 metric combinations into 9,000, each billed separately. Log retention defaults to "never expire," meaning storage costs accumulate silently unless you set an explicit policy.

A user on r/devops shared that serverless log ingestion in CloudWatch accounted for 50% of their total monthly bill:

Source: r/devops

For a full breakdown of every pricing tier, free tier limits, and optimization strategies, see our complete CloudWatch pricing guide.

Prometheus: "Free" Has Hidden Costs

Prometheus itself is open-source and has no licensing fee. But the total cost of ownership includes:

Compute and storage: Prometheus memory scales with active series, label cardinality, scrape interval, and query/rule load. Sizing is empirical, so load-test with your expected series count and retention before committing to hardware
Long-term storage: Prometheus's local TSDB can retain long periods of data, but it is a single-node design. For HA and large-scale retention, teams typically add Thanos, Cortex, or Mimir, each with its own infrastructure cost
Operations time: Upgrades, capacity planning, federation setup, and debugging scrape failures are ongoing responsibilities
High cardinality risk: Adding labels with many unique values (like user IDs or request IDs) causes series count to explode, degrading query performance and increasing storage costs

Managed Prometheus services (Amazon Managed Service for Prometheus, Grafana Cloud) eliminate some operational overhead but introduce their own pricing based on active series and ingestion volume.

CloudWatch vs Prometheus: Side-by-Side

Dimension	CloudWatch	Prometheus
Best fit	AWS-native workloads (Lambda, EC2, RDS, managed services)	Kubernetes, multi-cloud, hybrid environments
Setup effort	Low for AWS services (defaults); medium for detailed monitoring (agent)	High (instrumentation + scrape config + infrastructure)
Metrics	AWS service metrics automatic; custom metrics via agent or API	Custom app metrics via instrumentation, PromQL query depth
Logs	Native log groups/streams, Logs Insights SQL-like queries	Not a log system, needs Loki or external solution
Traces	AWS X-Ray integration for Lambda and other AWS services	Not a tracing backend, needs Jaeger/Tempo/Zipkin
Querying	Console-driven, metric math, Logs Insights	PromQL with label-based filtering and rich aggregation
Cost model	Usage-based (metrics, logs, queries, custom metrics charged separately)	Infrastructure + storage + operations time
Portability	AWS-locked	Vendor-neutral, runs anywhere
Alerting	CloudWatch Alarms (per-metric thresholds, composite alarms)	Alertmanager (flexible routing, grouping, silencing)
Ecosystem	Tight AWS integration, weaker outside AWS	Grafana, exporters, OpenTelemetry, broad community

Hybrid Patterns: Running Both

Some teams with mixed infrastructure (AWS and non-AWS) end up running CloudWatch and Prometheus together.

A typical hybrid pattern looks like this:

CloudWatch handles AWS-native signals: Lambda metrics/logs, RDS performance insights, ELB access logs, and S3 event monitoring
Prometheus handles application and platform metrics: custom app counters, Kubernetes pod metrics, service mesh telemetry, and infrastructure running outside AWS
A visualization layer (Grafana or a unified platform) brings both data sources into one set of dashboards

This works because CloudWatch and Prometheus have complementary strengths. CloudWatch's zero-setup Lambda monitoring does not conflict with Prometheus's deep PromQL-driven metric analysis for your Kubernetes services.

The challenge with this pattern is correlation. When an incident involves both an AWS-native service (monitored by CloudWatch) and an application service (monitored by Prometheus), you end up switching between tools to piece together the full picture. This context-switching slows down root cause analysis.

Unified Observability and Deep Correlations with SigNoz

The hybrid CloudWatch + Prometheus pattern works, but investigating incidents with it means jumping between separate consoles: CloudWatch Logs Insights for log queries, the CloudWatch Metrics console for dashboards, X-Ray for traces, Prometheus's Graph UI for PromQL analysis, and Grafana for combined visualization. Each tool uses different query syntax, and context does not carry over when you switch between them.

SigNoz is an all-in-one observability platform built natively on OpenTelemetry that eliminates this context-switching. Metrics, traces, and logs live in a single interface, so you can jump from a latency spike to the responsible trace to the related logs without leaving one screen.

SigNoz distributed trace flamegraph showing POST /process latency dominated by a downstream POST /enrich call with logs and metrics correlation.

CloudWatch's billing adds friction on top of the fragmented experience: log ingestion charges that vary by log type and region, custom metrics at $0.30/metric/month per dimension combination (in US East), Logs Insights charges of $0.005/GB scanned per query, and additional GetMetricData API charges when third-party tools poll metrics. SigNoz uses simple, usage-based pricing ($0.3/GB for logs and traces, $0.1/million metric samples) with no per-host charges or per-query fees.

Getting AWS and Prometheus Data into SigNoz

The most common adoption path teams follow is to keep CloudWatch as a telemetry source for AWS-managed services while making SigNoz the primary investigation and alerting layer:

CloudWatch Metric Streams push AWS infrastructure metrics to SigNoz via the OTel Collector.
CloudWatch Log subscriptions forward logs to SigNoz.
Application-level OTel instrumentation sends traces and metrics directly to SigNoz, replacing both Prometheus scraping and X-Ray collection with a single pipeline.

SigNoz offers a one-click AWS integration that uses CloudFormation and Firehose to set up the data pipeline automatically. For specific services, there are dedicated guides for EC2, Lambda metrics, Lambda logs, and Lambda traces.

Since SigNoz is built on OpenTelemetry, if you are already using OTel collectors or SDKs, the instrumentation stays the same. You point your exporter to SigNoz instead of CloudWatch or Prometheus. This also means no vendor lock-in in your instrumentation code, if you decide to switch backends later, the application-side code does not change.

Start by sending a copy of your data to SigNoz while keeping CloudWatch and Prometheus active. Verify that SigNoz covers your observability needs, then gradually shift your primary monitoring over.

Get Started with SigNoz

SigNoz Cloud is the easiest way to get started. It includes a 30-day free trial with access to all features, so you can test it against your real AWS workloads before committing.

If you have data privacy or residency requirements and cannot send telemetry outside your infrastructure, check out the enterprise self-hosted or BYOC offering.

For teams that want to start with a free, self-managed option, the SigNoz community edition is available.

Conclusion

CloudWatch and Prometheus are not direct substitutes, they are complementary tools with different strengths:

CloudWatch = fastest path to AWS-native observability, with metrics, logs, and X-Ray traces available through a managed experience.
Prometheus = flexible, portable metrics analysis with PromQL depth that CloudWatch cannot match, ideal for Kubernetes and multi-cloud environments.
SigNoz = unified observability workspace that brings metrics, traces, and logs together across your stack with an OpenTelemetry-native approach.

Your decision depends on your operating model. If your stack is mostly AWS and you need speed, start with CloudWatch. If you run multi-environment infrastructure and need PromQL depth, go with Prometheus. If you are running both and want to stop switching between fragmented consoles during incidents, SigNoz handles the observability side while letting you keep your existing telemetry sources.

Hope we answered all your questions regarding CloudWatch vs Prometheus. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

CloudWatch Cost Optimization Playbook [Part 2]

Ankit Anand ✨ — Mon, 23 Feb 2026 08:50:00 +0000

CloudWatch Pricing & Cost Optimization Series

Previous article: CloudWatch Pricing Without the Confusion [Part 1] Part 2 of 2

CloudWatch cost optimization is not a one-time cleanup. It is an ongoing loop where you find the driver, make a change, verify the impact, and repeat.

In Part 1, we explored every CloudWatch billing bucket — metrics, logs, alarms, dashboards, API requests, vended logs, and advanced features. This guide assumes you understand those mechanics and are ready to do something about them.

This guide will help you establish a baseline, apply optimizations bucket by bucket, understand why some optimizations fail, and evaluate when CloudWatch's pricing structure itself becomes the obstacle.

Step 1 - Establish Your Baseline

You cannot reduce what you have not measured. Before making any changes, you need a clear picture of where your CloudWatch spend is going.

Using AWS Cost Explorer

AWS Cost Explorer is the fastest way to see your CloudWatch spend by service and usage type.

Open Cost Explorer in the AWS Console.
Filter by Service = CloudWatch.
Group by Usage Type to break down costs across different billing axes (metrics, logs, alarms, API requests, etc.).
Set the time range to the last 3 months to identify trends.

The Usage Type dimension is the critical grouping. It separates CW:MetricMonitorUsage and CW:AlarmMonitorUsage from log-related usage types like DataProcessing-Bytes and DataProcessingIA-Bytes (typically region-prefixed in Cost Explorer, e.g., USE1-DataProcessing-Bytes), giving you the bucket-level breakdown that the invoice summary hides.

Cost Explorer grouped by Usage Type reveals which CloudWatch billing buckets are driving your spend. (credits: AWS Docs)

If Cost Explorer shows that 70% of your CloudWatch spend is log ingestion, you do not need to optimize alarms. Go directly to the log ingestion section.

Using CloudWatch Usage Metrics

CloudWatch publishes its own usage metrics in the AWS/Usage namespace. These give you near-real-time visibility into:

CallCount for API operations (how many PutMetricData, GetMetricData calls you are making).
ResourceCount for metrics and alarms (how many custom metrics or alarms exist right now).

You can set alarms on these usage metrics to catch runaway growth before it hits your invoice. For example, alarm on ResourceCount for custom metrics with a threshold that represents 120% of your expected count.

Step 2 - The Optimization Playbook, Bucket by Bucket

With your baseline in hand, you know which buckets are costing you the most. The sections below cover each one — what drives the cost, how to bring it down, and how to confirm it actually dropped.

Metrics: The Dimension Problem

Most metric cost comes from one place: the number of unique metric time series, which is driven by dimensions. As we covered in Part 1, each unique combination of metric name, namespace, and dimension values counts as a separate billable metric.

The first thing to check is whether you are paying for metrics nobody uses. Open Metrics → All metrics in the CloudWatch console and look for namespaces with metrics that have no associated alarms or dashboards. You can use the ListMetrics API to enumerate active metrics, then cross-reference with DescribeAlarms and GetDashboard to find metrics not attached to any alarm or dashboard. To check whether a metric is actually being read by your tooling, review CloudTrail for GetMetricData and GetMetricStatistics calls over the past 30 days. Note that ListMetrics only returns metrics that have received data in the past two weeks, so treat it as a view of currently active metrics, not a complete usage-history report.

The All Metrics view shows every namespace contributing to your metric count — each one is a potential cost driver.

Next, review your PutMetricData calls for dimension cardinality. If any dimension has more than ~50 unique values (status codes, endpoint names, customer tiers), then evaluate whether you can aggregate at the application level instead. Publishing api.latency with dimensions {endpoint, status_code, region, customer_tier} creates far more billable metrics than aggregating down to {endpoint, status_code} and tracking the rest in your application logs.

Finally, check whether you are paying for resolution you do not need. For example, EC2 detailed monitoring publishes metrics every 60 seconds and charges custom metric rates. Basic monitoring publishes every 5 minutes for free. For non-critical instances, switching to basic monitoring eliminates those charges entirely.

To verify: compare CW:MetricMonitorUsage in Cost Explorer against your baseline after making changes. Since metrics are prorated hourly, reductions show up within the same billing cycle.

Logs: Cut the Ingestion, Not the Visibility

Log ingestion is almost always the largest line item on a CloudWatch bill. At $0.50 per GB, it does not take much volume to generate meaningful spend. The goal is to control what you ingest, how long you keep it, and where it ends up.

CloudWatch Logs default to indefinite retention, and most teams never change it. Storage at $0.03/GB-month is cheap enough that nobody notices, until a year of logs has accumulated and the storage line item keeps growing with no upper bound.

Every log group here shows 'Never expire' — a common default that silently accumulates storage costs.

# List log groups with no retention set
aws logs describe-log-groups --query 'logGroups[?!retentionInDays].logGroupName'

Set retention to the minimum your compliance requires: 7 days for debug logs, 30 days for application logs, 90 days for audit logs. If you need long-term storage, export to S3 first as S3 storage rates are significantly cheaper than CloudWatch log storage.

If you are sending logs to both CloudWatch and another destination, ask whether CloudWatch needs to retain them at all. Use subscription filters to forward logs directly to the final destination and set CloudWatch retention to 1 day. You pay for ingestion either way, but you stop paying for storage.

For log groups that are primarily archival (things you rarely query but need to keep), switch to the Infrequent Access log class at $0.25/GB, half the standard rate. Logs Insights queries still work on this class. What you lose is Live Tail, metric filters, alarming from logs, and data protection, so only use this for log groups where you were never using those features anyway.

The highest-leverage move is filtering at the source, because the cheapest log is the one you never ingest. Check whether DEBUG-level logs are enabled in production, whether health check requests are being logged on every hit, and whether duplicate logs are being sent from multiple agents or sidecars — all common sources of unnecessary ingestion volume.

To verify: monitor ingestion volume via CloudWatch's own IncomingBytes metric on each log group, or check Cost Explorer's CW:DataProcessing-Bytes usage type. Changes should be visible within 24-48 hours.

Alarms: Audit the Orphans, Downgrade the Anomaly Detection

Alarm costs are driven by two things: how many you have, and what type they are. The first problem is usually orphaned alarms, the ones targeting metrics or resources that no longer exist.

Alarms stuck in INSUFFICIENT_DATA for weeks are almost certainly pointing at deleted resources, costing money every month for zero value — delete them.

The second problem is anomaly detection alarms used where static thresholds would work fine. As we covered in Part 1, each anomaly detection alarm uses 3 metrics internally (the metric plus upper and lower bounds), costing $0.30/month at standard resolution (3× the standard rate) or $0.90/month at high resolution. If a metric has stable, well-understood behavior (CPU above 80%, disk above 90%), a static threshold does the same job at a fraction of the cost. Reserve anomaly detection for metrics with genuinely seasonal or unpredictable baselines.

If you have groups of alarms that a human always checks together, consider consolidating them into a single composite alarm ($0.50/month). The underlying metric alarms still exist, but the composite alarm can replace redundant notification alarms and reduce operational noise.

To verify: count your alarms before and after using describe-alarms and compare by type. Anomaly detection replacements show up in the next billing cycle.

Dashboards: Consolidate or Delete

Dashboards are usually a small line item ($3.00 each beyond the 3 free), but they accumulate in teams that create one per microservice per environment. If you have 50 dashboards, that is $141/month.

Check whether multiple dashboards can be merged, a single dashboard monitoring 3 related services is cheaper than 3 separate ones. Any dashboard nobody has viewed in 60+ days is probably safe to delete.

API Requests: Watch for Polling Loops

API request costs are low for most teams, but they spike when external tools poll CloudWatch at high frequency. Third-party monitoring tools or custom scripts that call GetMetricData or GetMetricStatistics every few seconds can generate millions of requests per month.

If you have tools scraping CloudWatch to feed a separate system, consider switching to Metric Streams (push-based) instead of polling (pull-based), though note that Metric Streams have their own per-update cost. If polling is the only option, check whether you can reduce the frequency. Changing a 10-second interval to 60 seconds is a 6× reduction in API calls.

Vended Logs: Filter Before Delivery

VPC Flow Logs, Route 53 resolver query logs, and API Gateway access logs can generate massive volumes. Vended logs are billed per GB delivered, and the effective cost varies by destination because pricing tiers and rates can differ by destination type and region, and destinations like S3 or Firehose add their own service charges. Regardless of destination, you can reduce what gets delivered in the first place.

Consider using flow log filtering to capture only rejected traffic or specific subnets, and adjusting the aggregation interval based on your actual analysis needs. Every GB you do not generate is a GB you do not pay to deliver or store.

If you primarily need long-term storage and occasional batch analysis rather than real-time querying, consider routing vended logs to S3 instead of CloudWatch Logs. You will still incur vended-log delivery charges, and S3 adds its own storage and request costs — but S3 Standard storage runs ~$0.023/GB-month compared to CloudWatch's $0.03/GB-month, and S3 lifecycle policies can transition old logs to Glacier or Deep Archive for even cheaper archival. Compare totals for your destination and region before committing.

Step 3 - Why Optimization Sometimes Fails

You run through the playbook, cut metrics, set retention policies, remove orphan alarms — and then three months later, the bill is back where it started (or higher). This is frustrating, and it happens because some cost growth is structural rather than operational.

Your infrastructure is growing

More services, more instances, more log volume. If your application fleet grows 20% quarter-over-quarter, your CloudWatch costs grow at least 20% even if your per-unit costs stay the same. Optimization buys you headroom, but it does not change the growth curve.

New features get enabled without cost awareness

A teammate enables Container Insights on a new cluster, or a deployment pipeline starts publishing detailed monitoring on all instances. Each of these decisions is individually reasonable, but they compound. Without a process that reviews CloudWatch cost impact before enabling features, optimization becomes a recurring cleanup job rather than a permanent fix.

Cardinality creeps back in

You clean up metric dimensions today, but a new service deployment next month adds a high-cardinality tag. Without guardrails (linting PutMetricData calls, enforcing allowlists for dimensions), the problem returns.

The optimization ceiling exists

There is a floor below which CloudWatch costs cannot go for a given workload. If you run 50 services that each need metrics, logs, and alarms, the base cost of monitoring those 50 services in CloudWatch is fixed by the pricing structure. You can optimize the usage, but you cannot change the rates.

When you hit this ceiling that is when you have already removed waste, consolidated alarms, filtered logs, and reduced dimensions, the remaining cost is the structural price of using CloudWatch as your observability platform. At this point, the question changes from "how do I optimize my CloudWatch usage?" to "is CloudWatch the right pricing model for my workload?"

When Optimization Is Not Enough - Evaluating the Platform

If you have done the work in this playbook and your CloudWatch costs are still above your budget, the issue may not be your usage patterns. CloudWatch's pricing model bills across 7+ independent axes, each with its own unit, rate, and free tier, and the total grows with every new service you monitor.

This is where evaluating an alternative platform becomes a practical decision. Teams that have gone through this process typically report three pain points that optimization alone cannot fix:

Pricing unpredictability: CloudWatch's multi-dimensional billing makes it hard to forecast costs. INFO-level debug logging can produce large daily ingest bills. Adding a single high-cardinality dimension to custom metrics can multiply costs overnight. Teams report spending more time understanding and predicting their CloudWatch bill than actually reducing it.

Incident-time UX friction: During an outage, you need logs, metrics, traces, and service health in one place. CloudWatch spreads these across separate consoles with different query languages and navigation patterns. Trace-to-log correlation often requires manually copying IDs between interfaces. The context-switching tax is highest exactly when speed matters most.

Per-query billing during investigations: Logs Insights charges per GB scanned, and API polling adds incremental costs. During incidents, when engineers run dozens of exploratory queries against large log groups, the billing model works against the urgency. Teams report hesitating before running broad queries because of cost awareness, which slows down debugging.

SigNoz as a practical alternative

SigNoz is a unified observability platform built natively on OpenTelemetry that combines metrics, traces, and logs in a single interface with usage-based pricing:

Logs and traces: $0.3 per GB ingested. Pricing is ingestion-based with configurable hot/cold retention — there is no per-query scan charge like CloudWatch Logs Insights.
Metrics: $0.1 per million samples. Dashboards and alerts are included, not billed as additional line items.
No per-host or per-user pricing: Your bill scales with data volume, not infrastructure size or team headcount.

Since SigNoz is built on OpenTelemetry, if you are already using OTel collectors or SDKs, the instrumentation stays the same. You point your exporter to SigNoz instead of CloudWatch.

Getting AWS data into SigNoz

The most common adoption path teams follow is to keep CloudWatch as a telemetry source for AWS-managed services while making SigNoz the primary investigation and alerting layer:

CloudWatch Metric Streams push AWS infrastructure metrics to SigNoz via the OTel Collector.
CloudWatch Log subscriptions forward logs to SigNoz.
Application-level OTel instrumentation sends traces and metrics directly to SigNoz, bypassing CloudWatch entirely.

SigNoz offers a one-click AWS integration that uses CloudFormation and Firehose to set up the data pipeline automatically. For teams that want more control over costs, a manual OTel Collector setup gives you finer-grained control over what gets collected and forwarded.

Start by sending a copy of your data to SigNoz while keeping CloudWatch active. Verify that SigNoz covers your observability needs, then gradually shift your primary monitoring over. This avoids a risky all-at-once migration.

AWS-side cost caveat

Exporting data via Metric Streams or log subscriptions adds AWS charges. For Metric Streams, you pay the per-update fee ($0.003/1K updates). For log forwarding via subscription filters, you still pay CloudWatch Logs ingestion ($0.50/GB) plus whatever the downstream destination charges. These AWS-side costs go away only when you stop routing data through CloudWatch and instrument directly with OTel. The one-click integration path incurs additional AWS charges compared to the manual OTel Collector setup, so factor this into your migration plan.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Conclusion

CloudWatch cost optimization follows a loop:

Measure — Use Cost Explorer and Usage Metrics to find the biggest buckets.
Reduce — Apply bucket-specific moves: dimension cleanup, retention policies, alarm audit, log filtering.
Verify — Check that costs actually dropped using the same measurement tools.
Repeat — New deploys, new features, and infrastructure growth reintroduce costs. Optimization is ongoing.

When the loop stops producing results, when you have cut waste and the remaining cost is the structural price of CloudWatch's billing model, that is when you evaluate whether a platform with a different pricing structure is the more sustainable path.

You've reached the end of the series!

Congratulations on completing "CloudWatch Pricing & Cost Optimization Series".

Note: All prices are for the US East (N. Virginia) region as of February 2026. Verify against official pricing pages for current rates.

CloudWatch Pricing Without the Confusion [Part 1]

Ankit Anand ✨ — Mon, 23 Feb 2026 08:49:55 +0000

CloudWatch Pricing & Cost Optimization Series

Part 1 of 2 Next article: CloudWatch Cost Optimization Playbook [Part 2]

Your CloudWatch bill is a combination of at least seven different feature buckets, each with its own billing unit, free-tier boundary, and per-region scope. Some features bill per GB, others per metric-hour, others per alarm, and a few per API making it difficult to plan and predict observability costs.

In this guide, we unravel CloudWatch pricing by covering the main billing caveats that make CloudWatch pricing hard to predict and a simplified breakdown of every pricing bucket with rates and free tiers.

Part 2 covers what to do about it including optimization moves, verification steps, and when CloudWatch alone stops being cost-effective.

CloudWatch Pricing Details You Should Be Aware Of

CloudWatch started as a simple metrics and alarms service. Over time, AWS added Logs, Insights queries, Container Insights, Application Signals, X-Ray, RUM, Synthetics, and more. Each feature came with its own pricing model, and none of them were retroactively unified.

The result is a pricing page that reads like a legal document. Because structurally, it is one. Before getting into the rates, let's discuss the AWS cloudwatch billing behaviors that catch teams off guard.

How Dimensions Multiply Your Metric Costs

CloudWatch bills custom metrics based on the number of unique metric time series you publish. Each unique combination of metric name, namespace, and dimension values counts as a separate billable metric, at $0.30/month for the first 10,000.

The problem is that dimensions are multiplicative and adding a single high-cardinality tag can turn one metric into thousands.

Say you want to monitor API latency. You publish a metric called api.latency for 10 endpoints, tagged by status_code (3 values) and region (3 values):

10 endpoints × 3 status codes × 3 regions = 90 unique metrics = $27/month

That is for one metric name. Now add a customer_id dimension with 100 unique values:

90 × 100 = 9,000 unique metrics

Your bill: $2,700/month for a single metric name
Your expectation: "I added one tag"

This is the single most common cause of CloudWatch bill surprises. Teams running services with request_id or user_id as metric dimensions have seen this generate six-figure monthly charges.

You Pay Three Times for Logs

CloudWatch Logs pricing is a three-part tariff. You pay separately to collect logs, store them, and then search them.

Ingest: $0.50 per GB to collect your logs into CloudWatch.
Store: $0.03 per GB-month to keep them in compressed archival storage.
Query: $0.005 per GB scanned every time you run a Logs Insights query.

This means operational debugging has a direct, per-query cost that scales with your total log volume, not with how much useful data the query returns. If your query scans 500 GB (for example, by searching a broad time range) and your query matches 12 lines, you still pay for scanning all 500 GB: 500 × $0.005 = $2.50.

During incidents, this works against you. An engineer debugging a production outage runs dozens of exploratory queries against large log groups, each one scanning hundreds of GB. A 4-hour incident with heavy querying can rack up more Logs Insights charges than the entire previous week of normal usage.

The pricing model penalizes teams for using the tool at the exact moment they need it most. And as your team grows — more engineers, more services, more log data — query costs scale in both directions. Teams end up fighting against adoption internally, which is the opposite of why you introduce centralized logging in the first place.

The problem gets worse if you query a single monolithic log group with months of accumulated data. If that log group holds 500 GB and your team runs 20 queries a day, 500 × $0.005 × 20 × 30 = $1,500/month in query costs alone, regardless of how many results those queries return.

And storage itself is a slow-building trap. CloudWatch Logs default to indefinite retention — "Never expire." The storage rate of $0.03/GB-month is low enough that nobody notices it at first. But if your application ingests 100 GB/month and you never set a retention policy, after 12 months you've ingested 1.2 TB of logs. Your billed archived storage will typically be lower due to compression, but costs still grow monotonically unless you set retention.

The 3× Anomaly Detection Alarm Markup

CloudWatch offers four types of alarms, and the price difference between them is significant:

Alarm Type	Cost
Standard resolution (60-sec)	$0.10/alarm metric/month
High resolution (10-sec)	$0.30/alarm metric/month
Composite	$0.50/alarm/month
Anomaly detection (standard-res)	$0.30/alarm/month
Anomaly detection (high-res)	$0.90/alarm/month

The pricing works because each anomaly detection alarm generates three metrics internally: the original metric plus the upper and lower bounds of the expected band. So a standard-resolution anomaly detection alarm costs 3 × $0.10 = $0.30/month which is 3× a single standard alarm.

That multiplier compounds when teams switch all their alarms to anomaly detection "for safety" without realizing the cost difference:

20 standard alarms: $2/month
20 standard-res anomaly detection alarms: $6/month
20 high-res anomaly detection alarms: $18/month

Unless you genuinely need dynamic thresholds that adapt to seasonal patterns, the cost premium buys you very little.

Enabling One Feature Charges You Across Multiple Buckets

When you turn on Container Insights for an EKS cluster, it starts publishing CloudWatch metrics (billed under the Metrics bucket) and ingesting performance logs with 700 extra bytes of metadata per log line (billed under Logs). A 10-node EKS cluster in standard mode generates roughly 320 custom metrics — that alone is $96/month before any log costs.

Synthetics is worse. Each canary run costs $0.0012, but the real bill comes from the services each run triggers. Five canaries running every 5 minutes:

Canary run charges: $53.57/month
Lambda execution charges: $14.89/month
CloudWatch Logs charges: $3.55/month
S3 storage charges: $1.25/month
Custom metric charges: $0.90/month

Total: $74.66/month — of which $54.07 is the Synthetics canary + alarm portion. You will also see CloudWatch Logs ($3.55) and CloudWatch Metrics ($0.90) charges on your CloudWatch bill, plus Lambda ($14.89) and S3 ($1.25) charges under those respective services. The real cost is spread across multiple billing categories, making it hard to attribute back to Synthetics.

Free Tiers That Don't Pool

Core CloudWatch free-tier quotas (metrics, logs, dashboards, alarms) apply independently per feature bucket and are commonly scoped per account per region. Your 10 free custom metrics and your 5 GB free log data are completely separate allowances. If you use 0 custom metrics and 15 GB of logs, you still pay for 10 GB of log ingestion — the unused metric free tier does not help.

A few application-observability free tiers are scoped differently: Application Signals offers 3 months of free usage per account (with limits), and RUM includes a first-time trial of 1 million web events per account. These do not follow the per-region pattern.

Multi-region deployments multiply the core free tiers. If you operate in 3 regions, you do not get 15 GB of free log data total, instead you get 5 GB per region. If all your logs land in one region, the other two regions' free tiers go to waste.

A Simplified Breakdown of CloudWatch Pricing

With the caveats out of the way, here is a simplified rate card for the most common billing buckets. All prices are for US East (N. Virginia) as of February 2026 — always verify against the official AWS CloudWatch pricing page.

Bucket	Billing Unit	Rate	Free Tier (per region)	Key Gotcha
Metrics	Per metric/month (prorated hourly)	$0.30 (first 10K), $0.10 (next 240K), $0.05 (next 750K)	10 custom metrics and detailed monitoring metrics	Dimensions multiply your metric count exponentially
Log Ingestion	Per GB	$0.50 (standard), $0.25 (Infrequent Access)	Shared 5 GB/month	Largest cost driver for most teams
Log Storage	Per GB-month (compressed)	$0.03	Shared 5 GB/month	Accumulates indefinitely if no retention policy is set
Log Analytics	Per GB scanned (Insights), per minute (Live Tail)	$0.005/GB scanned, $0.01/minute	Shared 5 GB data; 1,800 Live Tail minutes	You pay per GB scanned, not per result returned
Alarms	Per alarm metric/month (prorated hourly)	$0.10 (standard), $0.30 (high-res), $0.50 (composite)	10 standard alarms	Anomaly detection uses 3 metrics per alarm (3× standard cost)
Dashboards	Per dashboard/month	$3.00	3 dashboards (up to 50 metrics each)	Adds up when teams create one per microservice
API Requests	Per 1,000 requests	$0.01	1,000,000 requests	Usually small; spikes when external tools poll frequently
Vended Logs	Per GB delivered (tiered)	$0.50/GB (first 10 TB) → $0.05/GB (50+ TB)	None	Also incurs downstream storage and query charges
Metric Streams	Per 1,000 metric updates	$0.003	None	Bills per update, not per unique metric — volume grows fast
Container Insights	Per million observations (enhanced) or standard metric/log rates	$0.21/million (EKS enhanced)	None	Generates metrics and logs billed at standard rates
Application Signals	Per million signals	$1.50 (first 100M) → $0.75 (next 900M) → $0.30 (1B+)	3-month free trial per account (limits apply)	Enables X-Ray tracing by default, adding trace charges
Synthetics	Per canary run	$0.0012	100 canary runs/month	Real cost is Lambda + S3 + Logs + metrics per run
RUM	Per 100K events (web) or per GB (mobile)	$1.00 / 100K events or $0.35 / GB	First-time trial: 1M web RUM events/account	Scales directly with user traffic
X-Ray	Per trace stored/scanned	$0.000005/trace stored	100K traces stored; 1M traces scanned	Often enabled silently by Application Signals

A Few Things Worth Noting About This Table

Prorating works in your favor — sometimes. CloudWatch prorates metrics and alarms on an hourly basis. You're billed for the fraction of hours the resource exists in the month: (hours active ÷ hours in month) × monthly rate. But prorating also makes estimation harder, because your bill depends on exactly when things were created or deleted.

Infrequent Access log class cuts ingestion cost in half. At $0.25/GB instead of $0.50/GB, it is a meaningful savings for log groups you do not need full features on. Logs Insights queries still work on Infrequent Access log groups. The trade-off is losing Live Tail, metric extraction and filters, alarming from logs, and data protection.

Vended logs have hidden downstream costs. The delivery rate is only part of the bill. VPC Flow Logs delivered to CloudWatch also incur standard log storage charges ($0.03/GB-month) and query charges if you run Insights against them. Routing vended logs to S3 instead avoids CloudWatch storage but still incurs S3 costs.

Advanced features always generate charges in the core buckets. Every advanced feature — Container Insights, Application Signals, Synthetics, RUM — produces metrics, logs, or API calls billed at core-bucket rates on top of its own charges. Estimating the cost of an advanced feature without accounting for its downstream impact on logs and metrics will undercount the real spend by 20-40%.

What Comes Next

CloudWatch pricing is not a single rate card. It is a collection of independent billing systems — 7+ feature buckets, per-region free tiers that do not pool, cross-bucket cost generation, and non-linear multipliers from dimensions and alarm types.

Understanding the mechanics is step one. Step two is doing something about it — finding your biggest cost drivers, applying targeted optimizations, verifying the savings, and deciding whether CloudWatch's pricing model is structurally compatible with your monitoring needs.

That is what Part 2 covers: a practical cost optimization playbook with bucket-specific moves, verification loops, and a framework for evaluating whether a unified observability platform like SigNoz — with simpler, usage-based pricing — is the better path forward.

Next in "CloudWatch Pricing & Cost Optimization Series" (Part 2 of 2)\
\
CloudWatch Cost Optimization Playbook [Part 2]

CloudWatch vs CloudTrail - What's the Difference and When to Use Each

Ankit Anand ✨ — Mon, 23 Feb 2026 08:41:28 +0000

CloudWatch and CloudTrail are two AWS services that sound similar but solve completely different problems. CloudWatch monitors the health and performance of your AWS resources, while CloudTrail records who did what in your AWS account and when.

If your Lambda function starts throwing errors, CloudWatch tells you about it. If someone changed your S3 bucket's encryption settings, CloudTrail tells you who made that change, from which IP address, and at what time.

This guide breaks down what each service does, shows you real console walkthroughs, and explains when to use one, the other, or both.

What is Amazon CloudWatch?

Amazon CloudWatch is a monitoring and observability service for AWS resources and the applications running on them. It collects metrics, aggregates logs, and triggers alarms based on thresholds you define.

CloudWatch answers questions like:

Is my EC2 instance running out of memory?
Why is my Lambda function timing out?
How many 5xx errors did my API Gateway return in the last hour?
Should my auto-scaling group add more instances?

Resource Monitoring with CloudWatch Metrics

CloudWatch automatically collects some performance metrics from AWS services, so EC2 instances report CPU utilization, network traffic, and disk I/O, while Lambda functions surface invocation count, duration, and error rate. RDS databases contribute connection count, read/write latency, and storage usage to the same metrics pipeline.

You can also push custom metrics from your applications for business-specific measurements like order processing time or queue depth.

EC2 instance Monitoring tab showing default CloudWatch metrics including CPU utilization, network I/O, and credit usage

Centralized Logging with CloudWatch Logs

Beyond metrics, CloudWatch Logs centralizes log output from your workloads. EC2 instances, Lambda functions, ECS containers, and API Gateway all route logs into CloudWatch Log Groups, where each group contains log streams organized by source.

For example, a Lambda function creates a new log stream for each execution environment. An EC2 instance running the CloudWatch agent sends application logs to a log group you configure.

CloudWatch Log Groups showing log data flowing in from Lambda functions and EC2 instances

Inside each log group, you can search and filter log events, set up metric filters to extract numeric values from log lines, and create subscription filters to stream logs to other destinations.

Viewing individual log events inside a CloudWatch log group

Threshold-Based Alerting with CloudWatch Alarms

With metrics and logs flowing in, the next step is acting on them. CloudWatch Alarms let you set thresholds on any metric and trigger actions when those thresholds are breached, whether that is sending a notification through SNS, triggering an Auto Scaling action, or invoking a Lambda function.

Each alarm has three states (OK, ALARM, and INSUFFICIENT_DATA), and you configure the evaluation period, comparison operator, and threshold value. For a full breakdown of alarm types and their costs, see the CloudWatch pricing guide. For example, you might create an alarm that fires when EC2 CPU utilization exceeds 80% for three consecutive 5-minute periods, sending a message to your on-call channel when it transitions to ALARM state.

Creating a CloudWatch Alarm with metric selection, threshold configuration, and evaluation conditions

The overall flow in CloudWatch follows a pattern: metrics arrive from your resources, you visualize them in dashboards, set alarms on critical thresholds, and respond when something degrades.

What is AWS CloudTrail?

AWS CloudTrail is a governance and auditing service that records API activity across your AWS account. Monitorable actions taken through the AWS Console, CLI, SDK, or another AWS service generate an event in CloudTrail.

CloudTrail answers questions like:

Who deleted that S3 bucket?
When was this IAM role's permissions changed?
Which IP address was used to modify the security group?
Did anyone access this KMS key in the last 30 days?

Auditing Account Activity with CloudTrail Event History

By default, CloudTrail retains 90 days of management event history at no cost. Management events are control-plane operations like creating an EC2 instance, modifying an S3 bucket policy, or changing an IAM role. The event history table surfaces the fields that matter for investigations, including event name, timestamp, user name, event source, and affected resources.

CloudTrail Event History showing recent API activity with columns for actor attribution: who did what, when, and to which resource

Analyzing CloudTrail Event Details

Drilling into any event reveals a JSON record with detailed attribution fields that map directly to the forensic questions you ask during an investigation:

Question	CloudTrail Field	Example Value
Who did it?	`userIdentity`	`arn:aws:iam::123456789012:user/admin`
What action?	`eventName`	`PutBucketEncryption`
When did it happen?	`eventTime`	`2026-02-11T09:15:32Z`
From where?	`sourceIPAddress`	`203.0.113.45`
Which service?	`eventSource`	`s3.amazonaws.com`
Which region?	`awsRegion`	`us-east-1`

Here is a real event detail page from the AWS Console, showing a PutBucketEncryption action with the full event record:

CloudTrail event detail for a PutBucketEncryption action, showing identity, source IP, region, and event JSON

CloudTrail Trails and Long-Term Retention

The 90-day event history covers many investigation needs, but compliance and long-term auditing often require more. For retention beyond that window, you create a Trail that delivers events to an S3 bucket (and optionally to CloudWatch Logs), or use CloudTrail Lake to store event data for up to ~10 years with the one-year extendable retention option (or ~7 years with the seven-year retention option).

There are four types of events you can capture:

Management events: Control-plane operations (creating, deleting, or modifying resources). Enabled by default.
Data events: Data-plane operations (S3 object-level actions like GetObject and PutObject, Lambda invocations). Must be enabled explicitly.
Network activity events: VPC-related network actions. Also require explicit configuration.
Insights events: Unusual API call rates or error rates detected by CloudTrail Insights. Must be enabled on a trail.

CloudWatch vs CloudTrail: Side-by-Side Comparison

Dimension	Amazon CloudWatch	AWS CloudTrail
Primary purpose	Monitoring and observability	Governance, compliance, and auditing
Core data	Metrics, logs, alarms, events	API event records (management, data, network)
Key question answered	"What is failing or degrading?"	"Who changed what, when, and from where?"
Primary users	DevOps, SRE, platform teams	Security, compliance, governance teams
Typical workflows	Dashboards, alerting, log search, troubleshooting	Event history, trail investigation, audit evidence
Default retention	Metrics: 15 months (varies by resolution). Logs: indefinite (you pay for storage)	90-day event history (free). Trails/Lake for longer.
Real-time	Yes, real-time or near real-time	Near real-time: often within minutes (average ~5 minutes), but not guaranteed and can take longer
Visualization	Built-in dashboards and graphs	Limited, often paired with Athena or third-party tools
Alerting	Native alarms with SNS, Auto Scaling, Lambda actions	Not primarily for alerting, but can trigger via EventBridge
Cost model	Based on metrics, logs volume, alarms, and dashboards	Based on event volume, trail storage, and Lake queries
Coverage scope	Individual resource performance	Account-wide API activity
Common integrations	SNS, Auto Scaling, Lambda, EventBridge	S3, CloudWatch Logs, IAM, AWS Config, EventBridge

How CloudWatch and CloudTrail Work Together: A Practical Scenario

Consider this scenario: your team gets paged because a Lambda function's error rate spiked at 9:15 AM.

Step 1: CloudWatch tells you something is wrong.

CloudWatch Alarms fire because the Lambda error metric exceeded the configured threshold. You open CloudWatch Logs and see that the function is failing with AccessDeniedException when trying to read from an S3 bucket.

At this point, you know the symptom: the function lost access to S3. But CloudWatch cannot tell you why the permissions changed or who changed them.

Step 2: CloudTrail tells you what happened.

You open CloudTrail Event History and filter by the S3 bucket name. At 9:10 AM, five minutes before the errors started, you see a PutBucketPolicy event. The userIdentity field shows it was a team member's IAM user. The sourceIPAddress shows the request came from the office VPN.

Now you have the full picture: a bucket policy change at 9:10 AM removed the Lambda function's read access, causing the errors that CloudWatch detected at 9:15 AM.

This is why most teams need both services. CloudWatch catches the operational impact. CloudTrail provides the accountability trail.

Where SigNoz Fits in Your Observability Stack

SigNoz Log Management Dashboard

CloudWatch works well for collecting AWS telemetry, but investigating incidents with it means jumping between four separate consoles: Logs Insights for log queries, the Metrics console for dashboards, X-Ray for traces, and Application Signals for service health. Each uses different query syntax, and clicking from metrics to logs resets your time filters. Finding logs for a trace requires copying trace IDs from X-Ray and running a separate search in CloudWatch Logs.

CloudWatch's billing adds another friction point: log ingestion at $0.50/GB, custom metrics at $0.30/metric/month per dimension combination, Logs Insights charges of $0.005/GB scanned per query, and additional GetMetricData API charges when third-party tools poll metrics. SigNoz uses simple, usage-based pricing ($0.3/GB for logs and traces, $0.1/million metric samples) with no per-host charges or per-query fees.

For AWS-specific integrations, you can send CloudWatch logs to SigNoz, use one-click AWS integrations, and set up dedicated monitoring for EC2, Lambda metrics, Lambda logs, and Lambda traces.

What SigNoz Does Not Replace

SigNoz does not replace CloudTrail, which remains the authoritative record for AWS account and API activity required for governance and compliance. No external observability platform substitutes that audit trail.

When a Lambda error spike appears in SigNoz, engineers verify runtime impact there, then check CloudTrail to confirm whether a recent IAM, S3, or API configuration change triggered it.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

FAQ

Is CloudTrail needed if we already use CloudWatch?

Yes. CloudWatch tracks resource performance and application logs. CloudTrail records who made API calls and what changes they made. If someone deletes an S3 bucket or modifies an IAM role, CloudWatch will not have a record of who did it. You need CloudTrail for that accountability.

Can CloudWatch tell us who changed IAM or S3 settings?

No. CloudWatch monitors metrics and logs from your running workloads. It does not record control-plane API activity like IAM policy changes or S3 bucket configuration updates. That information is only available in CloudTrail.

Does sending CloudWatch data to SigNoz remove the need for CloudTrail?

No. Forwarding CloudWatch metrics and logs to SigNoz gives you a unified observability workflow for operational telemetry. But CloudTrail's audit trail of API actions is a separate concern entirely. SigNoz complements CloudWatch for debugging and investigation, while CloudTrail remains your AWS-native record for governance and compliance.

Conclusion

CloudWatch and CloudTrail serve different purposes that become clear once you use both in a real incident:

CloudWatch = runtime behavior, resource health, alerting, and operational logs.
CloudTrail = API activity history, identity attribution, audit evidence, and compliance records.
SigNoz = unified observability workspace that brings metrics, traces, and logs together across your stack with an OpenTelemetry-native approach.

For teams looking to move beyond CloudWatch's fragmented console experience and unpredictable billing, SigNoz handles the observability side while CloudTrail continues to serve as your AWS-native audit trail.

Hope we answered all your questions regarding CloudWatch vs CloudTrail. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

Understanding OpenTelemetry - Trace ID vs. Span ID

Ankit Anand ✨ — Wed, 11 Feb 2026 05:45:26 +0000

Distributed systems pose unique challenges for developers and operations teams. How do you track a request as it travels through multiple services? Enter OpenTelemetry — a powerful observability framework that's changing the game. At its core, OpenTelemetry uses two key concepts: Trace ID and Span ID. But what exactly are these, and how do they differ?

In this article, you'll learn about the fundamentals of tracing in OpenTelemetry, including a deep dive into Trace IDs and Span IDs, how they help in tracking requests across distributed services, and how to implement them effectively. Additionally, we'll cover best practices for setting up and managing tracing within your application, including strategies to minimize performance overhead while maximizing observability insights. Whether you're new to OpenTelemetry or looking to refine your observability practices, this guide provides a comprehensive roadmap to successful tracing implementation in distributed systems.

What is OpenTelemetry and Why It Matters

OpenTelemetry is an open-source observability framework designed to provide visibility into modern distributed systems. It standardizes how data is collected, processed, and exported to monitor application performance.

In today's world, applications are becoming more complex, often relying on microservices and cloud-based architectures. Monitoring such distributed systems requires capturing detailed information about their operations. OpenTelemetry offers a unified approach to collecting observability data, making it essential for teams managing these complex systems.

Importance of OpenTelemetry in Modern Distributed Systems:

Standardization: OpenTelemetry consistently gathers telemetry data across various services, languages, and environments.
Interoperability: It integrates seamlessly with different observability backends like Prometheus, Jaeger, etc.
End-to-End Visibility: OpenTelemetry helps track requests across microservices, providing a clear understanding of system behavior.
Vendor-Neutral: By being an open standard, it prevents lock-in with specific monitoring tools, offering flexibility in how observability data is consumed.

Key Components of OpenTelemetry:

Traces: These capture the lifecycle of a request, showing how it moves across multiple services. For example, a user request starting from the web frontend and moving through several microservices can be traced end-to-end.
Metrics: Metrics help measure application performance by providing numerical data like request counts, error rates, and resource utilization.
Logs: Logs offer detailed records of events occurring within an application, helping diagnose errors or unusual behavior.

How OpenTelemetry Enhances Application Performance Monitoring:

Detailed Insights: Traces, metrics, and logs provide comprehensive insights into an application’s performance, helping identify bottlenecks, failures, or resource constraints.
Reduced Downtime: With clear observability, issues can be detected and resolved quickly, minimizing application downtime.
Performance Optimization: OpenTelemetry helps monitor system performance in real-time, ensuring that applications run smoothly and efficiently.

For example, consider an e-commerce platform that relies on multiple microservices. If one service slows down, it could affect the overall user experience. By using OpenTelemetry, this issue can be quickly identified through traces, metrics, and logs, allowing developers to resolve the problem before it escalates.

Demystifying Traces in OpenTelemetry

A trace is a critical part of observability in distributed systems. It helps track and monitor the flow of requests across various services. Distributed systems often involve multiple components communicating with each other, and understanding the performance of these interactions is essential. Traces capture this by recording the path that a request takes through a system.

Here’s how traces represent the journey of a request:

Tracking a request: A trace comprises of multiple spans, each representing a unit of work performed within the system. The trace follows the request from its starting point through different services and components, highlighting its journey.
Identifying bottlenecks: Traces provide visibility into how long each part of the system takes to process the request. This helps identify bottlenecks or delays at any point in the request’s journey.
Visualizing dependencies: Traces link various services involved in fulfilling a request, making it easier to visualize dependencies between those services and how they interact during the request lifecycle.

A trace typically consists of:

Root span: This represents the initial request that triggers the series of operations.
Child spans: These represent the subsequent operations or processes triggered by the root request. Each child span logs details, such as duration, metadata, and errors if any.

Practical Example

In an e-commerce application:

A customer places an order by interacting with the front-end service.
The frontend service makes a call to the order-processing service (root span).
The order-processing service calls the inventory and payment services (child spans).
Each service logs its operation as part of the trace, giving a complete view of the request's path through the application.

In this example, a trace helps developers or operators monitor the time taken at each step, allowing for performance optimization and issue resolution.

Understanding the Trace ID in OpenTelemetry

A Trace ID is a key element in distributed tracing within the OpenTelemetry framework, serving as a unique identifier for a single request or transaction as it propagates through various services.

The Trace ID is a 16-byte, 32-character hexadecimal string that helps identify and link all the spans (units of work) a request generates.
Its purpose is to allow easy tracing and monitoring of a request as it flows through multiple microservices or distributed systems.

How Trace IDs are generated and their uniqueness:

Trace IDs are automatically generated when a new trace is initiated, ensuring they are globally unique.
The uniqueness is achieved using a combination of random number generation and best practices for distributed systems. Some frameworks also use timestamp-based mechanisms to avoid collisions.
Every span within a trace is linked to the same Trace ID, allowing for correlation across services.

The role of Trace ID in correlating spans:

Each service processing a request contributes a span, but all spans share the same Trace ID. This allows you to see the entire lifecycle of a request across services.
For example, in a distributed microservices architecture, a request might pass through multiple services like authentication, order processing, and payment. Each of these services generates a span, and by using the Trace ID, it's easy to correlate these spans and understand the flow of the request.

Best practices for working with Trace IDs:

Propagate Trace IDs across all services involved in processing a request, ensuring no service operates in isolation.
Log Trace IDs are part of the observability strategy. Including the Trace ID in logs, metrics, and errors makes it easier to trace issues in distributed systems.
Monitor Trace ID length and structure to ensure compliance with OpenTelemetry standards. Misformatted IDs can result in failed trace correlations.
Avoid manually generating or manipulating Trace IDs. Let the tracing library handle the generation to prevent errors and inconsistencies.

In practice, OpenTelemetry libraries and instrumentation handle much of the complexity of generating and managing Trace IDs. However, ensuring consistent propagation and inclusion in logs can greatly enhance observability across distributed systems. For example, when working with frameworks like Spring Boot or Express.js, the Trace ID can be automatically injected into logging outputs, allowing developers to trace a request even if an error occurs deep within the system.

Deep Dive into Spans and Span IDs

A span is the fundamental unit of work in distributed tracing. Each span represents a specific operation or task performed within a larger process, and it captures crucial information such as start time, end time, and the operations executed within that task. Multiple spans are combined to create traces, which represent the complete workflow of a request as it moves through different services or systems.

Anatomy of a Span:

A span records essential details of an operation:

Start time: Marks the time when the operation begins.
End time: Captures when the operation is completed.
Operation name: Describes the task being performed within the span.
Tags and metadata: Additional information can be added, such as error logs or other relevant data for context.
Parent-child relationship: Spans may have parent spans, representing hierarchical relationships in complex workflows.
- Span IDs and Trace IDs:

Each span has a unique identifier known as a Span ID. These Span IDs are essential for distinguishing between operations within the same trace. Trace IDs tie multiple spans together, indicating they are part of the same overarching request. Span IDs are typically generated as random 64-bit integers, ensuring that each span can be uniquely identified.

Relationship Between Span IDs and Trace IDs:

Span IDs are linked to a common Trace ID during the same request or transaction. The Trace ID is a shared identifier that ties together all spans generated by that request, showing the entire journey of the request across services. Span IDs are individual pieces within that larger trace, representing specific operations.

Practical Examples

Web Request to a Microservice:

A span might represent the time to process a user request to a web service. The span could start when the request reaches the web server and end when the response is sent back to the user. During that time, the span would capture the operations performed, such as database queries or interactions with other microservices.

Database Query Operation:

Consider a scenario where a web application makes a call to a database. A span would represent this database call, capturing the start and end time of the query execution. In this case, the span might contain metadata about the query, such as the SQL statement or details about any errors.

Service-to-Service Communication:

In a microservices architecture, when one service calls another, a span can represent the duration of that request. This allows tracking of how long it took for one service to send a request and receive a response from another service, including any errors or delays.

Span Attributes and Events

Span attributes and events are key elements in distributed tracing, providing deeper visibility and context. Attributes describe the metadata associated with a span, while events track significant occurrences within the span's lifecycle.

Understanding Span Attributes and Their Importance

Span attributes are key-value pairs that provide important context for a span, such as HTTP methods, user IDs, or transaction identifiers. These attributes are used for filtering and querying traces, making it easier to pinpoint specific actions or problems. For example, an HTTP request span might have attributes like the HTTP method, URL, and user ID, which help identify and filter traces related to a particular user request.

How to Use Span Events to Mark Significant Occurrences

Span events are timestamped markers within a span that capture key moments, such as errors, retries, or transitions between stages. These events can indicate specific occurrences like the start of a database query or when a response is sent. For example, in an e-commerce application, events might be added to a span to mark when an order is processed, when an error occurs during payment, or when the order is completed.

Best Practices for Adding Context to Spans
- Include Relevant Metadata: When adding attributes to spans, focus on data that provides meaningful context. For instance, include attributes like service names, request paths, and user identifiers. This helps trace data be more useful for debugging and performance monitoring.
- Consistency is Key: Use standardized naming conventions for attributes and events across services. This ensures that traces are consistent, making it easier to analyze them collectively.
- Use Clear Naming: Use names that clearly describe the purpose or action of the attribute or event. For example, use http.method for the HTTP request method or request_received for the event marking the receipt of a request.

Examples of useful attributes:

http.method: "POST"
http.url: "/api/orders"
user.id: "1234"

Examples of useful events:

request_received: Marked when the request is first received.
order_processed: Indicates when the order is fully processed.
response_sent: Captures when the response is sent to the client.
- Balancing Detail and Performance in Span Creation
Avoid Overloading with Details: While it's important to capture enough information to diagnose problems, too many attributes or events can increase the overhead of tracing. Focus on the most relevant and impactful data that will aid in troubleshooting without affecting performance.
Optimize for Performance: Keep spans lightweight for routine operations, such as basic CRUD operations. However, include more detailed information to improve observability for complex transactions or error scenarios.

Example:

In a Node.js application, a span tracking an HTTP request might include:

Attributes like:
- http.method: "POST"
- http.url: "/api/orders"
- user.id: "1234"
Events like:
- request_received
- order_processed
- response_sent

Trace ID vs. Span ID: Key Differences and Use Cases

Understanding the distinction between Trace ID and Span ID is essential for efficient observability in distributed tracing. Both IDs play pivotal roles in tracking requests across multiple services but serve different purposes.

Trace ID is a unique identifier for a complete trace, representing a user request as it travels through various microservices. It helps in connecting all the spans related to a single transaction. For example, when a user makes a request to a web application, that request generates a Trace ID, which is sent with the request to different services involved in processing the transaction. Every service that processes the request adds its Span to the trace, but the Trace ID remains consistent across the services.
Span ID identifies individual operations within a trace. It is a unique identifier for a single unit of work in the trace. A Span can be considered a segment of the entire trace, such as a database query or an HTTP request handled by a specific microservice. For instance, if a trace involves three microservices, each service generates a span with its unique Span ID linked to the Trace ID to indicate it belongs to the same trace.

Key Differences:

Aspect	Trace ID	Span ID
Scope	Represents the entire flow of a request across services.	Represents a single operation or unit of work within the trace.
Persistence	Remains the same throughout the transaction.	Unique to each service or operation in the transaction.
Use Case	Tracking a request’s journey through multiple services, identifying bottlenecks, and troubleshooting across the entire transaction.	Measuring the latency or performance of individual operations in a service.
Visibility	Provides a high-level overview of the request path.	Provides detailed insights into the individual steps of the request.

Working Together:

Trace IDs and Span IDs complement each other in distributed tracing systems like OpenTelemetry or Jaeger. When a request is made, the Trace ID is generated and sent across all involved services. Each service, in turn, creates a Span to represent a particular operation. These spans are linked to the same Trace ID, allowing the full lifecycle of a request to be visualized from start to finish. This combination enables precise pinpointing of where performance issues arise within the system. For example, if a user encounters a delay when interacting with an app, traces can reveal that one specific service in the middle of the transaction is causing the delay based on the spans and their timestamps.

Common Pitfalls and Misconceptions:

Confusing Trace ID and Span ID: It’s easy to confuse the two, as both are used to track requests. However, the Trace ID is for the overall journey, while the Span ID represents the individual segments of that journey.
Overlooking Span Metrics: Focusing only on Trace IDs can lead to missing out on valuable insights from individual spans, such as latency or error rates in a specific microservice.
Misuse of IDs for Global Context: Some might incorrectly assume that Trace and Span IDs are universally shared across all services. In reality, the Trace ID is shared, but Span IDs are specific to each service, providing a more granular level of tracking.

Implementing Trace and Span IDs with OpenTelemetry

OpenTelemetry is an open-source framework designed for observability, enabling the collection of distributed traces, metrics, and logs. To instrument an application with OpenTelemetry, follow a structured approach to generate and propagate Trace and Span IDs effectively.

Step-by-Step Guide to Instrumenting Your Application with OpenTelemetry

Set Up OpenTelemetry SDK:

Start by installing the OpenTelemetry SDK for the relevant language. For instance, in JavaScript:

 npm install @opentelemetry/api @opentelemetry/sdk-trace-base @opentelemetry/exporter-trace-otlp-http

Initialize the Tracer:

Create and configure the tracer to capture traces. Example in JavaScript:

 const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-base');
 const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
 const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

 const provider = new NodeTracerProvider();
 provider.addSpanProcessor(new SimpleSpanProcessor(new OTLPTraceExporter()));
 provider.register();

Create Traces and Spans:

Create a trace by starting a root span. Each span represents a specific operation in the trace.

 const { trace } = require('@opentelemetry/api');
 const tracer = trace.getTracer('example-tracer');

 const span = tracer.startSpan('example-span');
 // Perform some operations here
 span.end();

Add Attributes to Spans:

Add custom attributes to spans for additional context.

 span.setAttribute('component', 'web-server');
 span.setAttribute('status', 'success');

Export Traces:
- Export the trace data to a backend service, such as SigNoz, Jaeger, or Zipkin.

Best Practices for Generating and Propagating Trace and Span IDs

Unique Trace and Span IDs:
- Ensure that each trace and span has a unique identifier. This will allow traces to be grouped together across services.
- For instance, Trace ID can be generated using UUIDs and Span IDs can be sequential or based on the parent-child relationship.

Propagate Context Across Boundaries:

Use HTTP headers or messaging protocols to propagate Trace and Span IDs across different services. For HTTP, the traceparent header is commonly used.

GET /api/your-endpoint HTTP/1.1
Host: yourapi.example.com
User-Agent: YourApp/1.0
Accept: application/json
Authorization: Bearer your-api-keys
Content-Type: application/json
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0b1a5b9a-01
tracestate: congo=t61rcWkgMzE

Correlation of Trace and Span IDs:
- Maintain correlation by ensuring the trace and span context is passed along when making requests to other services. This is vital for tracing requests that span multiple services or microservices.
Consistent Context Management:
- Use context management libraries available in OpenTelemetry to share trace context automatically across different operations and asynchronous tasks.

Ensuring Consistent Tracing Across Different Services and Languages

Standardized Context Propagation:
- OpenTelemetry provides a standard format for propagating trace context across different systems. By using this standard (like traceparent header), it is easy to propagate traces across services written in different languages.
Instrument Multiple Services:
- Each service in a distributed system should be instrumented with OpenTelemetry to ensure consistent tracing. For example, a microservice written in Go can have its own OpenTelemetry setup, but the trace and span context is propagated using standard HTTP headers when interacting with a Java-based service.
Unified Trace Management with SigNoz:
- Signoz can be used to collect, manage, and visualize trace data across multiple services, regardless of the programming language used. Integrating OpenTelemetry with Signoz allows trace data from different services to be aggregated into a single view for easy analysis.

Tools and Libraries for Simplifying Trace and Span ID Management

OpenTelemetry SDKs: OpenTelemetry provides SDKs for various programming languages, including Java, JavaScript, Python, Go, and more. These SDKs make it easier to implement tracing by offering built-in support for context propagation, span creation, and trace exporting.
OpenTelemetry Collector: The OpenTelemetry Collector is essential for managing and exporting trace data from different services to backend systems. It can be used as a standalone service to collect, process, and export telemetry data.
SigNoz: SigNoz is an open-source observability platform that integrates seamlessly with OpenTelemetry. It provides powerful visualization and monitoring tools for traces, logs, and metrics. By exporting trace data to SigNoz, traces can be visualized and correlated across multiple services, offering insights into system performance and bottlenecks.

Leveraging SigNoz for Advanced Trace and Span Management

While OpenTelemetry provides the instrumentation, you need a robust backend to make sense of all this data. This is where SigNoz comes in.

SigNoz is an open-source APM tool that works seamlessly with OpenTelemetry. It offers:

Powerful visualizations: See your traces and spans in an intuitive interface.
Advanced filtering: Quickly find the traces you need.
Alerts and anomaly detection: Get notified when things go wrong.
Customizable dashboards: Tailor your views to your specific needs.

SigNoz Cloud is the easiest way to run SigNoz. Sign up for a free account and get 30 days of unlimited access to all features.

You can also install and self-host SigNoz yourself since it is open-source. With 24,000+ GitHub stars, open-source SigNoz is loved by developers. Find the instructions to self-host SigNoz.

SigNoz offers both cloud and self-hosted options. The cloud option is great for teams that want a hassle-free setup, while the self-hosted version gives you complete control over your data.

Best Practices for Effective Tracing

To get the most out of your tracing implementation:

Be consistent: Use the same naming conventions and attribute keys across services.
Propagate context: Ensure trace context is passed between services, even across different technologies.
Sample wisely: In high-volume systems, sample traces to reduce overhead.
Use baggage: Leverage OpenTelemetry's baggage feature to pass request-scoped data along the trace.
Monitor your monitoring: Keep an eye on the performance impact of your tracing.

A helpful tip for troubleshooting is to start with the Trace ID when debugging. This gives you an overview. Then, drill down into specific Span IDs for detailed information.

Key Takeaways

Trace IDs uniquely identify a request's journey through your distributed system.
Span IDs represent individual operations within a trace.
OpenTelemetry provides a standardized approach to implementing tracing.
Effective use of Trace and Span IDs significantly improves system observability.
Tools like SigNoz can help you make sense of your tracing data.

Good tracing requires balance. Provide enough detail to be useful, but avoid overwhelming your system or team.

FAQs

What is the difference between a Trace ID and a Span ID?

A Trace ID identifies an entire request flow across multiple services, while a Span ID identifies a specific operation within that flow. Think of a Trace ID as the title of a book and Span IDs as the chapter numbers.

How are Trace IDs and Span IDs generated in OpenTelemetry?

OpenTelemetry typically generates Trace IDs and Span IDs as random 128-bit integers. The exact generation method can vary depending on the implementation, but the goal is always to ensure uniqueness.

Can a single Trace ID have multiple Span IDs?

Yes, and it usually does. A single Trace ID will have multiple Span IDs, one for each operation or step in the request flow. This allows for detailed tracking of each part of the request.

How long should I retain Trace and Span ID data?

Retention periods depend on your specific needs and regulatory requirements. Generally, keeping trace data for 7-30 days is common. For debugging recent issues, shorter retention (like 7 days) might suffice. For longer-term analysis, you can keep data for a month or more.

Bringing Observability to Claude Code: OpenTelemetry in Action

Ankit Anand ✨ — Wed, 11 Feb 2026 05:45:14 +0000

AI coding assistants like Claude Code are becoming core parts of modern development workflows. But as with any powerful tool, the question quickly arises: how do we measure and monitor its usage? Without proper visibility, it’s hard to understand adoption, performance, and the real value Claude brings to engineering teams. For leaders and platform engineers, that lack of observability can mean flying blind when it comes to understanding ROI, productivity gains, or system reliability.

That’s where observability comes in. By leveraging OpenTelemetry and SigNoz, we built an observability pipeline that makes Claude Code usage measurable and actionable. From request volumes to latency metrics, everything flows into SigNoz dashboards, giving us clarity on how Claude is shaping developer workflows and helping us spot issues before they snowball.

In this post, we’ll walk through how we connected Claude Code’s monitoring hooks with OpenTelemetry and exported everything into SigNoz. The result: a streamlined, data-driven way to shine a light on how developers actually interact with Claude Code and to help teams make smarter, evidence-backed decisions about scaling AI-assisted coding.

Why Monitor Claude Code?

Claude Code is powerful, but like any tool that slips seamlessly into a developer’s workflow, it can quickly turn into a black box. You know people are using it, but how much, how effectively, and at what cost? Without telemetry, you’re left guessing whether Claude is driving real impact or just lurking quietly in the background.

That’s why monitoring matters. With the right observability pipeline, Claude Code stops being an invisible assistant and starts showing its true footprint in your engineering ecosystem. By tracking key logs and metrics in SigNoz dashboards, we can answer questions that directly tie usage to value:

Total token usage & cost → How much are we spending, and where are those tokens going?
Sessions, conversations & requests per user → Who’s using Claude regularly, and what does “active usage” really look like?
Quota visibility → How close are we to hitting limits (like the 5-hour quota), and do we need to adjust capacity?
Performance trends → From command duration over time to request success rate, are developers getting fast, reliable responses?
Behavior insights → Which terminals are people using (VS Code, Apple Terminal, etc.), how are decisions distributed (accept vs. reject), and what tool types are most popular?
Model distribution → Which Claude variants (Sonnet, Opus, etc.) are driving the most activity?

Together, this info transforms Claude Code from “just another AI tool” into something measurable, transparent, and optimizable. Monitoring gives you the clarity to not only justify adoption but also to fine-tune how Claude fits into developer workflows.

And that’s where the observability stack comes in. OpenTelemetry and SigNoz give us the tools to capture this data, export them cleanly, and turn raw usage into actionable insights. Let’s take a closer look at how they fit into the picture.

OpenTelemetry and SigNoz: The Observability Power Duo

What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework that makes it easy to collect telemetry data—traces, metrics, and logs—from across your stack. It’s a CNCF project, widely adopted, and built with flexibility in mind. The key advantage? You instrument once, and your telemetry can flow to any backend you choose. No vendor lock-in and no tangled integrations.

For Claude Code, this means we can capture usage and performance signals at a very granular level. Every request, every session, every token consumed can be traced and exported via OpenTelemetry. Instead of Claude Code being a black box, you now have standardized hooks to surface: how long requests take, how often they succeed, and which models or terminals are driving activity.

What is SigNoz?

SigNoz is an all-in-one observability platform that pairs perfectly with OpenTelemetry. Think of it as the dashboard and analysis layer. The place where all your Claude Code telemetry comes to life. With SigNoz, you can visualize logs and metrics in real time, slice usage data by user or model, and set alerts when things go wrong.

In our case, that means building dashboards that track:

Token usage & costs over time
Requests per user and per terminal type
Command durations and success rates
Model distributions (e.g., Sonnet vs Opus)
User decisions (accept vs reject)

By combining OpenTelemetry’s standardized data collection with SigNoz’s rich visualization and alerting, you get a complete observability stack for Claude Code. The result is not just raw logs and metrics. It’s a full picture of Claude Code in action, right where you need it.

Monitoring Claude Code

Check out detailed instructions on how to set up OpenTelemetry instrumentation for your Claude Code usage over here.

Option 1 (VSCode)

Step 1: Launch VSCode with telemetry enabled

CLAUDE_CODE_ENABLE_TELEMETRY=1 
OTEL_METRICS_EXPORTER=otlp 
OTEL_LOGS_EXPORTER=otlp 
OTEL_EXPORTER_OTLP_PROTOCOL=grpc 
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.us.signoz.cloud:443" 
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-ingestion-key>" 
OTEL_METRIC_EXPORT_INTERVAL=10000 
OTEL_LOGS_EXPORT_INTERVAL=5000 
code .

Set the us to match your SigNoz Cloud region
Replace <your-ingestion-key> with your SigNoz ingestion key

This will open VSCode with the required environment variables already configured. From here, any Claude Code activity will automatically generate telemetry and export logs to your SigNoz Cloud instance.

For convenience, you can also clone our bash script, update it with your SigNoz endpoint and ingestion key, and run it directly.

Option 2 (Terminal)

Step 1: Launch Claude Code with telemetry enabled

CLAUDE_CODE_ENABLE_TELEMETRY=1 
OTEL_METRICS_EXPORTER=otlp 
OTEL_LOGS_EXPORTER=otlp 
OTEL_EXPORTER_OTLP_PROTOCOL=grpc 
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.us.signoz.cloud:443" 
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-ingestion-key>" 
OTEL_METRIC_EXPORT_INTERVAL=10000 
OTEL_LOGS_EXPORT_INTERVAL=5000 
claude

Set the us to match your SigNoz Cloud region
Replace <your-ingestion-key> with your SigNoz ingestion key

This will launch Claude Code with telemetry enabled. Any Claude Code activity in the terminal session will automatically generate and export logs and metrics to your SigNoz Cloud instance.

For convenience, you can also clone our bash script, update it with your SigNoz endpoint and ingestion key, and run it directly.

Administrator Configuration

Administrators can configure OpenTelemetry settings for all users through the managed settings file. This allows for centralized control of telemetry settings across an organization. See the settings precedence for more information about how settings are applied.

The managed settings file is located at:

macOS: /Library/Application Support/ClaudeCode/managed-settings.json
Linux and WSL: /etc/claude-code/managed-settings.json
Windows: C:\ProgramData\ClaudeCode\managed-settings.json

Example managed settings configuration:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://collector.company.com:4317",
    "OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer company-token"
  }
}

Managed settings can be distributed via MDM (Mobile Device Management) or other device management solutions. Environment variables defined in the managed settings file have high precedence and cannot be overridden by users.

Example Configurations

# Console debugging (1-second intervals)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=console
export OTEL_METRIC_EXPORT_INTERVAL=1000

# OTLP/gRPC
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Prometheus
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=prometheus

# Multiple exporters
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=console,otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=http/json

# Different endpoints/backends for metrics and logs
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_LOGS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://metrics.company.com:4318
export OTEL_EXPORTER_OTLP_LOGS_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://logs.company.com:4317

# Metrics only (no events/logs)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Events/logs only (no metrics)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_LOGS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Your Claude Code activity should now automatically emit logs and metrics.

Finally, you should be able to view logs in Signoz Cloud under the logs tab:

Claude Code Logs View

When you click on any of these logs in SigNoz, you'll see a detailed view of the log, including attributes:

Detailed view of Claude Code User Prompt

You should be able to see Claude Code related metrics in Signoz Cloud under the metrics tab:

Overview of available Claude Code metrics in SigNoz

When you click on any of these metrics in SigNoz, you'll see a detailed view of the metric, including attributes:

Detailed view of Claude Code metrics in SigNoz

Making Sense of Your Telemetry Data

Metrics

Once you’ve wired Claude Code into OpenTelemetry and SigNoz, you’ll start to see a rich stream of metrics flowing in. But raw numbers don’t mean much until you know what they represent. Let’s break down the key metrics Claude Code exports and why they matter for teams looking to understand usage and impact.

claude_code.session.count → How many CLI sessions are being started? This tells you how frequently developers are reaching for Claude in their day-to-day workflow.
claude_code.lines_of_code.count → Tracks the number of lines of code modified. A simple way to measure how much “hands-on” coding Claude is influencing.
claude_code.pull_request.count → Keeps count of pull requests created. Helpful for seeing if Claude is actually contributing to shipped code rather than just local tinkering.
claude_code.commit.count → Monitors the number of Git commits tied to Claude-assisted sessions. Great for measuring real integration into development cycles.
claude_code.cost.usage → Shows the cost of each session in USD. This is key for keeping budgets in check and spotting whether usage is spiking unexpectedly.
claude_code.token.usage → Tracks the number of tokens consumed. Useful for understanding scale, model efficiency, and forecasting spend.
claude_code.code_edit_tool.decision → Captures developer decisions when Claude suggests edits (accept vs. reject). Over time, this paints a picture of trust and adoption.
claude_code.active_time.total → The total active time (in seconds) a session runs. Think of this as a measure of “engagement depth”—longer active times often signal serious coding assistance.

With these metrics visualized in SigNoz, you move from raw telemetry to stories about usage: how often developers lean on Claude, how much code it influences, and whether it’s paying off in commits, pull requests, and team efficiency.

Logs

Metrics give you the what and how much, but logs tell the story behind the numbers. Claude Code exports a variety of rich logs through OpenTelemetry that let you dig into the details of how developers interact with the assistant in real time. Here’s a breakdown of the key event types and what they mean:

User Prompt Event (claude_code.user_prompt)

Logged whenever a developer submits a prompt. Attributes include timestamp, prompt length, and (optionally) the prompt itself if you’ve enabled OTEL_LOG_USER_PROMPTS=1. This is your front-row seat into what kinds of requests developers are making and how frequently.

Tool Result Event (claude_code.tool_result)

Captures the outcome of a tool execution. You’ll see the tool name, whether it succeeded or failed, execution time, errors (if any), and the developer’s decision (accept or reject). With this, you can measure not just tool usage but also trust and reliability.

API Request Event (claude_code.api_request)

Fired on every API call to Claude. Attributes include model name, cost, duration, token counts (input/output/cache), and more. This is where you connect usage directly to cost efficiency and performance.

API Error Event (claude_code.api_error)

Logged when an API request fails. You’ll see error messages, HTTP status codes, duration, and retry attempts. These events are critical for debugging reliability issues and spotting patterns like repeated failures on specific models or endpoints.

Tool Decision Event (claude_code.tool_decision)

Records when a tool permission decision is made—whether developers accept or reject a suggested action, and the source of that decision (config, user override, abort, etc.). Over time, this shows how much developers trust Claude’s automated suggestions versus stepping in manually.

By streaming these events into SigNoz, you don’t just know that “Claude Code was used X times.” You can see the full lifecycle of interactions from a prompt being entered, to tools executing, to API calls completing (or failing), all the way to whether a developer accepted the outcome. It’s observability not just at the system level, but at the human + AI collaboration level.

From Data to Dashboards: Bringing Claude Code Logs & Metrics to Life

Once you've got Claude Code's telemetry flowing into SigNoz, you can build dashboards to monitor critical metrics like total token usage, request patterns, and performance bottlenecks. You can check out our Claude Code dashboard template here.

Total Token Usage (Input & Output)

Tokens are the currency of AI coding assistants. By splitting input tokens (developer prompts) and output tokens (Claude’s responses), this panel shows exactly how much work Claude is doing. Over time, you can see whether usage is ramping up, stable, or dropping off—and keep an eye on efficiency.

Dashboard showing total input and output token usage for Claude Code

Sessions and Conversations

This panel tracks how many CLI sessions and conversations are happening. Sessions show how often developers are turning to Claude, while conversations capture depth of interaction. Together, they reveal adoption and engagement.

Tracking Claude Code sessions and conversations

Total Cost (USD)

Claude Code usage comes with a cost. This panel translates token consumption into actual dollars spent. It’s a quick way to validate ROI, spot runaway usage early, and ensure your AI assistant remains a cost-effective part of the toolchain.

Total cost tracking in USD for Claude Code token usage

Command Duration (P95)

How long do Claude-assisted commands actually take? This chart tracks the 95th percentile duration, helping you catch slowdowns, spikes, or performance regressions. Developers want Claude to be fast—this view keeps latency in check.

95th percentile command duration tracking for Claude Code requests

Token Usage Over Time

Instead of looking at total tokens in a snapshot, this time series shows usage trends. Are developers spiking usage during sprints? Is there a steady upward adoption curve? This view is perfect for spotting both growth and anomalies.

Time series view of Claude Code token usage trends

Success Rate of Requests

Not every request to Claude is successful. This panel highlights how often requests succeed vs. fail, helping you spot reliability issues—whether from the model, connectivity, or developer inputs. A healthy success rate means smooth workflows.

Success rate tracking for Claude Code API requests

Terminal Type

Claude Code is flexible, but developers use it differently depending on environment. This pie chart shows where developers are working—VS Code, Apple Terminal, or elsewhere. Great for understanding adoption across dev setups.

Pie chart showing distribution of Claude Code usage across different terminal types

Requests per User

Usage isn’t always evenly distributed. This table breaks down requests by user, making it clear who’s leaning on Claude heavily and who’s barely touching it. Perfect for identifying champions, training needs, or power users.

Table showing request volume breakdown by individual users

Model Distribution

Claude ships with multiple models, and not all usage is equal. This panel shows which models developers are actually calling. It’s a handy way to track preferences and see if newer models are gaining traction.

Distribution of Claude model usage

Tool Types

Claude can call on different tools—like Read, Edit, LS, TodoWrite, Bash, and more. This breakdown shows which tools are most frequently used, shining a light on the kinds of coding tasks developers are trusting Claude with.

Breakdown of tool types used by Claude Code

User Decisions

AI suggestions only matter if developers use them. This panel tracks accept vs. reject decisions, showing how much developers trust Claude’s output. High acceptance is a sign of quality; high rejection is a signal to dig deeper.

User decision metrics showing accept vs. reject rates for Claude Code suggestions and tool outputs

Quota Usage (5-Hour Rolling Window)

Claude Code subscriptions often come with rolling quotas that reset every 5 hours. This panel tracks how much of that rolling limit has been used based on your specific subscription plan, giving you an early warning system before developers hit hard caps. Instead of being caught off guard by usage rejections, teams can proactively manage consumption and adjust workflows as they approach the threshold.

5-hour rolling quota usage tracking to monitor consumption against subscription limits

Taken together, these panels create more than just a pretty dashboard. They form a control center for Claude Code observability. You can see usage patterns unfold in real time, tie costs back to activity, and build trust in Claude’s role as part of the development workflow. Whether you’re keeping budgets in check, tracking adoption, or optimizing performance, dashboards give you the clarity to manage AI-assisted coding at scale.

Wrapping It Up

As AI coding assistants like Claude Code become part of daily developer workflows, observability isn’t optional—it’s essential. By combining Claude Code’s built-in monitoring hooks with OpenTelemetry and SigNoz, you can transform raw telemetry into a living, breathing picture of usage, performance, and cost.

From tracking tokens and costs, to understanding which tools and models developers actually rely on, to surfacing adoption trends and decision patterns, observability gives you the power to manage Claude Code with the same rigor you bring to any other critical piece of infrastructure. Dashboards then tie it all together, turning streams of data into a real-time pulse of how Claude Code powers development.

The result? Teams gain the confidence to scale Claude Code usage responsibly, optimize for performance and spend, and most importantly, make evidence-backed decisions about how AI fits into their engineering culture. With visibility comes clarity and with clarity, Claude Code becomes not just an assistant, but a measurable driver of developer productivity.

Top 9 Lightstep (ServiceNow) Alternatives in 2025 (And How to Migrate)

Ankit Anand ✨ — Wed, 12 Nov 2025 08:10:27 +0000

ServiceNow recently announced the end-of-life (EOL) for Lightstep, now known as Cloud Observability. Support will officially end on March 1, 2026, or at the end of your contract, and ServiceNow has confirmed there will be no direct replacement or migration path.

This news can be disruptive for engineering teams who rely on Lightstep for critical observability. The announcement highlights a significant risk in relying on proprietary, closed-source tools for essential infrastructure.

At SigNoz we believe critical infra shouldn’t hinge on a vendor’s roadmap. Observability is too critical a piece to leave to such risks. With open source, you always have an escape hatch.

If you're now searching for a stable, future-proof alternative to Lightstep, you're in the right place. This guide will walk you through the best Lightstep alternatives, with a focus on open-source and OpenTelemetry-native solutions, and show you how to make a clean migration.

What was Lightstep? A Quick Refresher

Lightstep was a powerful observability platform, highly regarded as an early leader in distributed tracing. Many engineering teams chose it for its:

Strong OpenTelemetry Support: Lightstep was an early and significant contributor to OpenTelemetry (OTel), building its platform to provide first-class OpenTelemetry support. This meant users could instrument their applications with open standards, avoiding agent lock-in.
Unified Observability: It provided unified dashboards to visualize logs, metrics, and traces in one place, which helped teams correlate different telemetry signals during an investigation.
Advanced Tracing Features: It offered features like service diagrams, root cause analysis, and "Change Intelligence" to correlate performance deviations with deployments.

A good Lightstep alternative should ideally match or exceed these capabilities, providing a unified view of telemetry while being built on open standards.

Top 6 Lightstep Alternatives

Here are the top alternatives to Lightstep, evaluated based on their support for logs, metrics, traces, open-source nature, and ease of migration.

1. SigNoz

SigNoz is a full-stack, open-source observability platform that unifies logs, metrics, and traces in a single application. It is built to be natively compatible with OpenTelemetry, making it a natural and seamless replacement for Lightstep.

SigNoz dashboard

Key Features & Strengths:

Truly Unified Observability: SigNoz uses a single backend (Columnar database) for logs, metrics, and traces, so you can pivot from a trace to related logs without switching tools.
OpenTelemetry-Native: SigNoz is built from the ground up to support OpenTelemetry. There are no proprietary agents, and it fully supports OTel semantic conventions, ensuring a smooth migration from Lightstep.
Cost-Effective and Transparent Pricing: The open-source version is free to self-host. SigNoz Cloud offers a simple, usage-based pricing model without per-user or per-host fees, which is often more predictable and affordable than competitors.
Powerful Querying and Analytics: You can run advanced analytics on your telemetry data, such as finding the p99 latency for all services with a specific tag.
Control Over Your Data: With both cloud and self-hosted options, you can choose where your data resides, which is crucial for compliance and privacy.

Considerations & Trade-offs:

Compared to Datadog, its ecosystem of pre-built integrations is still growing. However, its OTel-native nature means it can ingest data from any system that supports OTLP.

Verdict:

SigNoz is a strong candidate for teams looking for a direct, future-proof Lightstep alternative. Its open-source nature, native OpenTelemetry support, and unified data model provide a powerful, cost-effective, and flexible solution for teams who want to own their observability data and avoid vendor lock-in.

2. Datadog

Datadog is a market-leading, all-in-one SaaS observability platform that covers everything from infrastructure monitoring and APM to log management and security.

Datadog dashboard (credits: Datadog)

Key Features & Strengths:

Comprehensive Feature Set: Datadog offers an incredibly broad range of features consolidated into a single platform, reducing tool sprawl.
Extensive Integrations: It has hundreds of integrations that make setup easy for a wide variety of technologies.
Polished UI: The dashboards are highly customizable and the platform includes AI-powered features like Watchdog for anomaly detection.

Considerations & Trade-offs:

High and Unpredictable Cost: Datadog's multi-vector pricing (per-host, per-GB, per-user, custom metrics) is notoriously complex and often leads to bill shock. OpenTelemetry metrics can be charged as expensive "custom metrics."
Vendor Lock-in: The platform heavily relies on its proprietary agent, which can make it difficult and costly to migrate away from.

For a detailed breakdown of Datadog's pricing complexities and hidden costs, check out this comprehensive guide: Datadog Pricing Main Caveats Explained.

The verdict:

Datadog is a fit for large enterprises with a significant budget who prioritize a single, deeply integrated platform. However, for teams concerned about cost and vendor lock-in, the very issues highlighted by Lightstep's EOL, Datadog's model presents similar risks.

3. New Relic

New Relic is an observability pioneer that has evolved into a full-stack platform. It offers a unified solution with a generous free tier.

New Relic dashboard (credits: New Relic)

Key Features & Strengths:

Strong APM Capabilities: New Relic provides deep, code-level insights and powerful transaction tracing.
Generous Free Tier: Its free tier (100 GB of data ingest per month) provides an easy entry point for small teams.
OpenTelemetry Support: New Relic offers excellent support for OpenTelemetry, allowing you to send data via OTLP without using their proprietary agents. However, it's important to understand that the platform is OTel-compatible, not OTel-native. It ingests and translates OpenTelemetry data into its internal proprietary format.

Considerations & Trade-offs:

Cost at Scale: Despite a simplified model, costs can escalate significantly with high data volumes or a large number of users.
UI Complexity: The user interface can be cluttered and has a steep learning curve for new users.

The verdict:

New Relic is a solid option for teams that need deep APM insights and want to start with a free tier. However, be mindful of costs as your usage grows.

4. Honeycomb

Honeycomb is a SaaS platform designed for analyzing high-cardinality, high-dimensionality data. It focuses on event and trace-based debugging for complex distributed systems.

Honeycomb dashboard (credits: Honeycomb)

Key Features & Strengths:

Excellent for High-Cardinality Data: Honeycomb excels at letting you slice and dice data by any attribute, which is invaluable for debugging novel production issues.
Developer-Centric Workflow: Features like "BubbleUp" automatically highlight outlier attributes, speeding up root cause analysis.
OpenTelemetry-Native: It's a strong proponent of OTel and offers simple, event-based pricing.

Considerations & Trade-offs:

Not a Full-Stack Solution: Honeycomb is hyper-focused on tracing and events. Its capabilities for traditional infrastructure monitoring and unstructured log management are less mature. You will likely need other tools to get a complete picture.

The verdict:

Honeycomb is a great choice for developer-centric teams who prioritize deep, investigative tracing. If your main goal is to replace Lightstep's tracing capabilities, Honeycomb is a strong contender, but you'll need a separate solution for metrics and logs.

5. Grafana OSS Stack (LGTM)

The Grafana stack is a popular open-source, composable observability solution consisting of Loki (for logs), Grafana (for visualization), Tempo (for traces), and Mimir/Prometheus (for metrics).

Grafana dashboard (credits: Grafana)

Key Features & Strengths:

Best-in-Class Visualization: Grafana provides beautiful, highly customizable dashboards that can pull data from countless sources.
Open and Composable: Being fully open-source, it offers maximum flexibility and no vendor lock-in. It's a cornerstone of the CNCF ecosystem.
Strong Community: It has a vibrant open-source community and is the de-facto standard for visualizing Prometheus metrics.

Considerations & Trade-offs:

Operational Complexity: The biggest challenge is managing four separate, distributed systems (Loki, Grafana, Tempo, Mimir). This requires significant operational expertise and can be resource-intensive.
Siloed Data: While Grafana can visualize data from different sources, the backends are separate. This makes it harder to perform deep correlations between logs, metrics, and traces compared to a unified system like SigNoz.

The verdict:

The Grafana stack is ideal for teams with strong DevOps expertise who want full control over their observability pipeline and prioritize visualization. However, if you're looking for a simpler, unified experience out-of-the-box, it might not be the best fit.

6. Jaeger

Jaeger is a graduated CNCF open-source project and a well-known standard for end-to-end distributed tracing.

Jaeger dashboard

Key Features & Strengths:

Powerful Open-Source Tracing: Jaeger is a robust, scalable, and free solution dedicated to distributed tracing. It provides rich visualization tools for trace timelines and service dependency graphs.
CNCF Standard: As a CNCF project, it's tightly aligned with the cloud-native ecosystem and OpenTelemetry.

Considerations & Trade-offs:

Tracing Only: Jaeger's main limitation is its exclusive focus on tracing. It does not handle logs or metrics. To get full observability, you must integrate and manage it with other tools like Prometheus for metrics and an ELK stack for logs.

The verdict:

Jaeger is an excellent foundational component for a DIY observability stack. If you only need to replace Lightstep's tracing functionality and are prepared to manage separate systems for logs and metrics, Jaeger is a solid, cost-free choice.

7. Elastic Observability

Elastic Observability is a unified solution built on the well-known Elastic Stack (formerly ELK Stack), combining log management, APM, and infrastructure monitoring. It leverages the power of Elasticsearch for search and analytics, making it a strong contender, especially for teams with log-heavy workloads.

Elastic Observability dashboard (credits: Elastic)

Key Features & Strengths:

Powerful Log Analytics: Its greatest strength is the underlying Elasticsearch engine, which offers exceptionally fast and flexible search and analytics capabilities for log data.
Unified Experience: It provides a single, unified experience for logs, metrics, and traces within the Kibana interface, allowing for seamless correlation between different signals.
Open-Core Foundation: The solution is built on an open-source foundation, which provides a low-friction entry point and avoids immediate vendor lock-in. It is often seen as a more cost-effective alternative to Splunk, particularly for self-hosted deployments.

Considerations & Trade-offs:

Operational Complexity: The primary trade-off is the complexity of managing a large-scale Elasticsearch cluster, which requires significant expertise in tuning and scaling.
Cloud Costs: While Elastic Cloud removes the operational burden of self-hosting, its costs can be confusing and unexpectedly high for large workloads.
Maturing APM: Its Application Performance Monitoring (APM) solution is generally considered less mature and automated than APM-native competitors, which might require more manual configuration.

Verdict

Elastic Observability is ideal for engineering teams whose primary need is powerful log search and analytics. It suits organizations that either have the in-house expertise to manage a complex distributed system or are willing to pay for the convenience of the managed cloud service. It’s a solid choice for those who want a unified, open-core solution for log-heavy environments.

8. Splunk Observability Cloud

Splunk Observability Cloud is a comprehensive SaaS platform that combines APM, infrastructure monitoring, and real user monitoring, built to integrate tightly with Splunk's industry-leading log management platform. It is designed for enterprise-scale operations and is fully OpenTelemetry-native.

Splunk Observability Cloud dashboard (credits: Splunk)

Key Features & Strengths:

Seamless Log Integration: Its standout feature is Log Observer Connect, which seamlessly links metrics and traces in the Observability Cloud with the deep log analytics of the core Splunk platform, without requiring duplicate data ingestion.
Full-Fidelity Tracing: The platform offers NoSample™ Full-Fidelity Tracing, which captures 100% of trace data. This is valuable for organizations that cannot afford to miss any transaction data for debugging or compliance.

Considerations & Trade-offs:

High Cost: Splunk is notoriously expensive, and the Observability Cloud is no exception. It is one of the premium-priced solutions on the market.
Segregated Architecture: Logs typically live in the Splunk Platform, while metrics/traces live in Splunk Observability Cloud. Log Observer Connect correlates these without re-ingest. This separation can add architectural complexity versus a single-store design.

Verdict:

Splunk Observability Cloud is the ideal choice for large organizations already heavily invested in the Splunk ecosystem for log management, security, and SIEM. It offers an APM solution that integrates well with existing Splunk deployments. However, for companies without a prior Splunk investment, the high cost and segregated architecture make it a less compelling option.

9. Dynatrace

Dynatrace dashboard (credits: Dynatrace)

Dynatrace is an enterprise-grade, all-in-one observability and security platform distinguished by its heavy emphasis on AI-powered automation. Its core mission is to provide answers, not just data, by automatically identifying root causes.

Key Features & Strengths:

AI-Powered Root Cause Analysis: Its strongest differentiator is the "Davis" AI engine, which automatically identifies dependencies, detects anomalies, and provides precise root-cause analysis with minimal manual configuration, significantly reducing Mean Time To Resolution (MTTR).
Automated Discovery and Deployment: Its "OneAgent" technology offers a highly simplified, zero-touch deployment that automatically discovers all application and infrastructure components, providing a real-time topology map.
Full-Stack Context: Dynatrace provides deep context with its proprietary "PurePath" tracing technology, which captures end-to-end transaction details.

Considerations & Trade-offs:

Prohibitive Cost: Like other enterprise-focused tools, Dynatrace is very expensive and generally priced for large organizations, making it inaccessible for smaller teams.
Proprietary Nature: While it supports OpenTelemetry, its primary value comes from the proprietary OneAgent and Davis AI. This can create a sense of being in a "black box" for some engineering teams and leads to potential vendor lock-in.
UI Complexity: Users often describe the platform as a somewhat disjointed collection of tools with a steep learning curve, and documentation can be a pain point.

Verdict:

Dynatrace is best for large enterprises with complex, hybrid-cloud environments that are willing to pay a premium for a platform that heavily automates discovery, dependency mapping, and root-cause analysis. It suits organizations that prefer an AI-driven, hands-off approach to observability.

How to Migrate from Lightstep to SigNoz

Migrating from Lightstep to SigNoz is straightforward, thanks to OpenTelemetry. You won't need to change any code in your applications. The entire process involves re-configuring where your telemetry data is sent.

Here’s a simple 3-step guide:

Step 1: Get Started with SigNoz

The quickest way to start is with SigNoz Cloud. You can create an account in minutes and get a secure endpoint to send your data to. We offer a 30-day free trial account with access to all features.

Once you sign up, you'll find your ingestion details in the SigNoz UI under Settings -> Ingestion Details. You'll need the Region and Ingestion Key.

Step 2: Reconfigure Your OpenTelemetry Collector

You are likely using an OpenTelemetry Collector to send data to Lightstep. To switch to SigNoz, you just need to update the exporter configuration in your collector's YAML file (config.yaml).

Before (Lightstep Config):

Your old configuration probably looked something like this:

exporters:
  otlp/lightstep:
    endpoint: ingest.lightstep.com:443
    headers:
      "lightstep-access-token": "${LIGHTSTEP_ACCESS_TOKEN}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/lightstep]

After (SigNoz Config):

Update the exporters section to point to the SigNoz Cloud endpoint for your region.

exporters:
  otlp/signoz:
    # Replace <REGION> with your SigNoz cloud region
    # ex: us, in, eu
    endpoint: "ingest.<REGION>.signoz.cloud:443"
    headers:
      "signoz-ingestion-key": "${SIGNOZ_INGESTION_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      # Switch the exporter to SigNoz
      exporters: [otlp/signoz]

Replace <REGION> with your actual SigNoz region (e.g., us, eu, or in) and set the SIGNOZ_INGESTION_KEY environment variable to your ingestion key from Step 1. After applying the new configuration, your telemetry will start flowing to SigNoz.

For a detailed guide, check out our documentation on migrating from an existing OpenTelemetry setup.

Step 3: Verify and Decommission

We recommend running both platforms in parallel for a short period. Once you've verified that all your data is flowing correctly into SigNoz and have recreated any necessary dashboards or alerts, you can safely decommission your Lightstep integration.

While migrating from Lightstep to any other platform, there are a few important caveats to keep in mind:

Dashboards & Alerts: Existing dashboards, alerts, and saved queries in Lightstep cannot be directly ported. Teams will need to rebuild these in the new system.
Collector Differences: If you were using Lightstep’s older custom exporter (before OpenTelemetry became the standard), you may need to adjust or replace those configurations when switching to a pure OTLP pipeline.

Also note that historical data cannot be migrated from Lightstep to SigNoz. You will be starting fresh in the new system.

Other Ways to Get Started with SigNoz

You can choose between various deployment options in SigNoz apart from SigNoz cloud (shown above).

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Conclusion

The end-of-life for Lightstep is a significant event, but it's also an opportunity to move to a more resilient, open, and cost-effective observability stack. By choosing an OpenTelemetry-native platform like SigNoz, you not only find a safe landing place but also protect yourself from future vendor-driven disruptions. The migration is simpler than you might think, and the long-term benefits of owning your observability data are well worth the effort.

Hope we answered all your questions regarding Lightstep alternatives. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

Chronosphere vs Datadog: Which Observability Platform is Right for You in 2025?

Ankit Anand ✨ — Wed, 12 Nov 2025 08:10:04 +0000

As organizations adopt cloud-native architectures like microservices and Kubernetes, they face an explosion of telemetry data. This data is essential for understanding system health, but it also brings significant challenges: soaring costs, tool fragmentation, and persistent alert fatigue for on-call engineers.

Choosing the right observability platform is critical to managing this complexity. In the observability space, Datadog and Chronosphere are two popular solutions. Datadog is the established, all-in-one market leader, known for its broad feature set and ease of use. Chronosphere is a newer, cloud-native challenger focused on taming data at scale and improving the on-call experience.

This article provides a detailed comparison of Chronosphere and Datadog to help you decide which platform best fits your needs. We'll compare them across key criteria: architecture, cost control, open-source alignment, and on-call experience.

At a Glance: Chronosphere vs. Datadog

Here’s a high-level summary of how the two platforms stack up on key features.

Feature Area	Chronosphere	Datadog
Primary Focus	Cost control & reliability for cloud-native scale	All-in-one observability for a broad range of use cases
Cost Model	Predictable, based on valuable data after filtering	Consumption-based across many SKUs, can be complex
Data Control	Granular control plane to filter/aggregate at ingest	Provides robust but different controls. For example Metrics without Limits™, custom-metrics governance and cardinality tooling, though much of the shaping happens before or at indexing rather than as an at-ingest control plane.
Open-Source Compatibility	Native Prometheus & OpenTelemetry support	Supports OTel, but ecosystem favors proprietary agents
Platform Breadth	Focused on Metrics, Logs, Traces (MELT)	Extensive: MELT, RUM, Synthetics, Security, and more (1,000+ integrations)
Alerting Philosophy	Opinionated, SLO-driven to reduce noise	Flexible, but can easily lead to alert fatigue if unmanaged
Ease of Initial Setup	More upfront configuration required	Very fast time-to-value with agents & pre-built integrations
High Cardinality Handling	Architected for high-cardinality data	Can become very expensive at high cardinality

Core Philosophies: Different Approaches to Observability

The biggest difference between Chronosphere and Datadog lies in their core philosophies for handling data volume and cost. Datadog's approach relies on pre-ingest control (filtering before data is sent), while Chronosphere is built around at-ingest control (shaping data as it arrives).

Datadog: The "Collect-Then-Pay" Model (Pre-Ingest Control)

Datadog's tracing dashboard (credits: Datadog)

Datadog’s philosophy is to be a single platform for all your observability needs. Its strength is its plug-and-play nature; you install an agent and immediately get value from pre-built dashboards and integrations.

The model is simple: if data reaches Datadog's platform, you are generally billed for it. This places a significant burden on cost control, largely requiring you to manage and reduce data before it is sent to Datadog (pre-ingest control). This often involves:

Agent-Side Filtering: Manually configuring Datadog Agents or application code to limit the collection of metrics and logs at the source.
Observability Pipelines: Utilizing tools like Datadog's Observability Pipelines to filter, sample, or transform data before forwarding it.
DIY Pipelines: Some teams build and maintain their own intermediary pipelines (e.g., using OpenTelemetry Collectors) for pre-processing.

While Datadog offers in-product controls like Logging without Limits™ (which decouples log ingest from indexing) and Metrics without Limits™ (for metric tag governance), the fundamental strategy for cost-conscious users heavily relies on shaping data before it fully enters Datadog's billable storage. The trade-off for Datadog's initial simplicity can be a more reactive and complex approach to cost management, often necessitating actions outside or prior to the main platform.

Chronosphere: The "Collect-and-Shape" Model (At-Ingest Control)

Chronosphere's Trace Control Plane (credits: Chronosphere)

Chronosphere was founded by the creators of M3, the open-source metrics engine built at Uber to handle massive scale. It is built around a core problem: not all telemetry data is equally valuable, yet traditional platforms force you to pay for all of it.

Its philosophy is to move the entire data-shaping process inside its platform at the point of ingestion. You can send all your raw, high-fidelity data, and Chronosphere's control plane analyzes it in real-time, allowing you to:

Filter out low-value, noisy data.
Aggregate high-cardinality metrics into more useful, lower-cardinality forms.
Retain only the data you need for dashboards and alerts, all governed by rules you define within the platform.

Chronosphere even provides analytics on which data is actually being used, helping you make informed decisions. This approach is designed specifically for cloud-native environments that generate huge volumes of high-cardinality data. The trade-off is that it requires more upfront thought and configuration to define what data is valuable.

Detailed Comparison Between Datadog and Chronosphere

Let's dive deeper into how these different philosophies play out in practice.

Architecture and Scalability

Both platforms are SaaS solutions designed for scale, but they are built on different foundations.

Datadog uses a proprietary backend that has proven to scale for thousands of customers. Its agent-based architecture is effective for collecting data from a vast number of hosts. However, as telemetry volume and cardinality grow, the onus is on the user to limit data collection at the source to control costs and potential performance issues.

Chronosphere is built on the open-source M3DB, a time-series database designed from the ground up to handle billions of time series and high-cardinality metrics. Its architecture is explicitly optimized for the ephemeral, high-churn nature of Kubernetes environments. This gives engineering teams confidence that the platform won't falter as their microservices footprint grows.

Cost Management and Data Control

This is the most significant differentiator between the two platforms.

Datadog follows a traditional consumption-based pricing model. You pay per host, per GB of ingested logs, for custom metrics, and for various other product SKUs. This model is complex and can lead to surprise bills. Cost control is a manual and reactive process, often requiring engineers to remove instrumentation or build custom pre-processing pipelines to reduce the data sent to Datadog.

Chronosphere's core value proposition is its telemetry control plane. It analyzes all incoming data in real-time and provides tools to dynamically filter, aggregate, or drop low-value data at the point of ingestion.

For example, you can choose to retain a metric at 10-second resolution for alerting but only store it at 1-minute resolution for long-term trending. This approach can lead to significant cost savings by separating data collection from data storage, allowing you to pay only for the data you actually use.

Open Source Compatibility & Instrumentation

Your ability to avoid vendor lock-in often comes down to instrumentation.

Chronosphere is built to be fully compatible with open standards. It offers first-class support for Prometheus and OpenTelemetry. You can send data using the Prometheus remote-write protocol, query it with PromQL, and ingest OpenTelemetry traces without proprietary agents. This preserves your investment in open-source tooling and skills and ensures your instrumentation is portable if you ever decide to switch platforms.

Datadog has improved its open-source support and can now ingest OpenTelemetry data. However, its ecosystem is still heavily centered around the Datadog Agent and proprietary APM libraries. To get the most out of the platform, you are often encouraged to use their tools, which can lead to vendor lock-in. While you can use PromQL-like syntax, you cannot use native PromQL for queries and alerts.

Platform Breadth & Ease of Use

Datadog is the clear winner on breadth of features. It's a true "one-stop shop" that covers everything from infrastructure monitoring and logs to RUM, security, and more. With 1,000+ pre-built integrations and default dashboards, its time-to-value is extremely fast. The UI is polished and generally considered intuitive, though it can feel complex due to the sheer number of features.

Chronosphere is more focused on core observability (MELT) for cloud-native engineers. Its logging and APM capabilities are newer and less mature than Datadog's. Getting started requires more initial effort to configure data sources and build dashboards. However, its user experience is purpose-built for troubleshooting workflows, providing a more curated and less overwhelming interface for on-call engineers.

Alerting and On-Call Experience

Reducing alert fatigue is a key goal for modern observability.

Datadog provides a highly flexible alerting system. You can create threshold-based alerts, anomaly detection monitors, and composite alerts. However, this flexibility makes it easy for teams to create too many low-signal alerts, contributing to alert fatigue. While Datadog offers an SLO module, it's up to the teams to implement best practices.

Chronosphere takes an opinionated approach to improve the on-call experience. It heavily promotes Timeslice SLOs—a time-windowed SLO approach that Datadog also supports—to reduce false alarms. Timeslice SLOs measure reliability in time windows (e.g., "99.9% of 5-minute intervals were successful") rather than on raw event counts. This makes alerts less sensitive to brief, insignificant spikes and dramatically reduces false alarms, allowing on-call engineers to focus on sustained issues that truly impact users.

SigNoz: The OpenTelemetry-Native Alternative to Datadog

While Datadog and Chronosphere offer powerful proprietary solutions, there is a third path: an open-source, OpenTelemetry-native platform like SigNoz. SigNoz was built from the ground up to be the best choice for teams committed to open standards and cost control.

True OpenTelemetry-Native Support

SigNoz is built natively on OpenTelemetry, while Datadog prioritizes its proprietary agent. This OTel-native approach provides key advantages:

OTel-first Documentation: Clear instructions make it simple to integrate any data source instrumented with OpenTelemetry.
Automatic Exception Tracking: SigNoz automatically captures and displays exceptions from OTel trace data in a dedicated tab.
No Custom Metric Penalty: Custom metrics in Datadog are charged separately and can get very expensive at scale. SigNoz treats all metrics equally, with simple and affordable pricing.

Up to 9x Better Value for Money

SigNoz can save you up to 80% on your Datadog bill by addressing common pricing pain points.

Simple Usage-Based Pricing: Unlike Datadog's complex SKU-based pricing, SigNoz offers a straightforward plan based on the amount of data you send.
No Host-Based Pricing: Datadog's per-host pricing model often forces teams to pack more services onto a single host just to control costs. SigNoz has no per-host charges, allowing you to architect your systems for performance, not to appease a vendor's pricing model.
Granular Cost Controls: You can control infrastructure monitoring costs by choosing exactly which metrics to send from the OpenTelemetry Collector. Features like Ingest Guard allow you to set data ingestion limits for different teams or environments.

Flexible Deployment Options

SigNoz offers deployment models to fit any need, from a free open-source community edition to a fully-managed cloud service and a self-hosted enterprise version for organizations with strict data privacy requirements.

Getting Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Conclusion

Choosing between Chronosphere and Datadog is a decision about priorities.

Datadog offers unparalleled breadth and ease of use, making it an excellent choice for teams who want a comprehensive, batteries-included solution and are willing to manage the associated costs.

Chronosphere provides a powerful, targeted solution for cloud-native companies struggling with data volume and alert noise, offering significant cost savings and a better on-call experience in exchange for a more focused feature set.

The right platform depends on your organization's scale, technical maturity, and budget. As you evaluate your options, consider whether an open-source, OpenTelemetry-native platform like SigNoz might offer the perfect balance of power, flexibility, and control.

Hope we answered all your questions regarding Chronosphere vs Datadog. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.

Sumo Logic vs. Datadog: The Definitive Comparison for 2025

Ankit Anand ✨ — Wed, 12 Nov 2025 08:09:19 +0000

Datadog and Sumo Logic, both offer powerful tools to monitor applications. While they often appear in the same conversations, they were built with different core philosophies that shape their features, costs, and the day-to-day experience for engineers.

Datadog is widely recognized as a market leader in infrastructure and application performance monitoring (APM). Sumo Logic, conversely, established its roots in log management and security analytics, positioning itself as a converged platform for both observability and security operations.

This article provides a definitive, deep-dive comparison of Sumo Logic vs. Datadog. We'll go beyond marketing features to explore the technical details that matter most during implementation and incident response, from data path control to agent overhead.

Core Focus and Philosophy

Understanding the origins of each platform is key to grasping their current strengths.

Datadog's focus is on unified, real-time performance monitoring. It began by providing deep visibility into infrastructure metrics and has since expanded into a comprehensive, all-in-one observability platform. It's designed for DevOps and SRE teams who need to quickly diagnose performance issues across a complex stack, from frontend user experience down to the underlying network.

Datadog's Request Flow Map Dashboard (credits: Datadog)

Sumo Logic’s focus is on converged security and log analytics. It's designed to ingest and analyze massive volumes of log data. This foundation makes it exceptionally strong for deep troubleshooting, compliance, and security investigations. Its key differentiator is the native integration of a Security Information and Event Management (SIEM) solution, creating a single source of truth for development, security, and operations teams.

Sumo Logic's Logs Dashboard (credits: Sumo Logic)

Feature Comparison

While both platforms cover the three pillars of observability: logs, metrics, and traces. Their feature sets and depth vary significantly.

Feature / Capability	Datadog	Sumo Logic
Infrastructure Monitoring	✓✓	✓✓
Log Management & Analytics	✓✓	✓✓
APM (Tracing & Profiling)	✓✓	✓✓
Real User & Synthetic Monitoring	✓✓	✓ (Limited)*
Cloud SIEM	✓✓	✓✓
Cloud SOAR (Security Orchestration)	✓✓	✓✓
Cloud Security Posture Management (CSPM)	✓✓	✗
Built-in Incident Management	✓	✓
Generous Free Tier	✗	✓

✓✓ - Feature is fully available ✓ - Partial or limited feature ✗ - Feature is not available

Sumo provides native RUM; synthetic tests are typically surfaced via provider integrations rather than a deep first-party suite.

Under the Hood: Sumo Logic vs Datadog

Beyond feature lists, experienced engineers need to know how these platforms behave under load, how data flows before it hits your bill, and what the operational experience is like during a real incident.

Data Pipeline Control and Cost Management

Controlling telemetry before you pay for it is critical for managing costs at scale. Let's explore how both platforms handle different aspects of data pipeline control and cost management.

Trace Sampling:

Datadog defaults to head-based sampling in its tracers, meaning the decision to keep or drop a trace is made at the beginning of a request. You can achieve more intelligent tail-based sampling (making the decision at the end, once the full trace context is available) by running the OpenTelemetry Collector in front of Datadog.

Sumo Logic ships an OpenTelemetry-based collector that allows for rule-driven filtering and shaping of traces before they leave your environment, giving you direct control over data volume and cost.

Log Ingestion and Filtering:

Datadog processes logs through a series of sequential pipelines before indexing. In these pipelines, you define rules to parse, enrich, or filter your logs. For example, you can create a rule to drop logs with a certain status code to control costs. This pre-processing is powerful but requires you to define the structure of your data upfront.

Sumo Logic, on the other hand, uses Field Extraction Rules (FERs). This allows you to apply parsing logic either as logs are ingested or, more flexibly, at the time you run a query. This "schema-on-read" approach is ideal for unstructured data because you don't need to know how you want to search a log at the time it's collected. However, it means that investigations often rely more heavily on crafting complex queries.

Cold Storage and Rehydration:

Both platforms allow you to archive logs to your own S3 bucket to save costs, but their retrieval mechanics differ.

Datadog’s Log Rehydration reads archived objects from your bucket for the selected time window, then applies your query. Because the query is evaluated after the archive files for that time range are downloaded, scan size and cloud data-transfer costs depend primarily on the time window you choose, not just the query selectivity. Narrowing the time window is the best way to reduce scan size and retrieval cost.

Also, rehydration only supports specific S3 storage classes— Standard, Standard-IA, One Zone-IA, Glacier Instant Retrieval, and Intelligent-Tiering (only if the asynchronous archive tiers are disabled) as documented in Datadog's archive configuration guide.

The implication is that if your S3 lifecycle policies automatically move logs to these colder, cheaper storage tiers to save money, you won't be able to rehydrate them in Datadog without first manually restoring them to a supported class. This adds extra steps and time during an incident or audit when you need urgent access to old logs.

Sumo Logic allows on-demand ingestion from your S3 archive with a 5-minute granularity, pulling data back into the platform when needed.

The term "on-demand ingestion" means you can selectively re-ingest data from a specific time range when you need it. The "5-minute granularity" refers to the precision with which you can specify this time range. For example, you can tell Sumo Logic to pull all logs from 10:05 PM to 10:10 PM on a specific date, allowing you to narrow your focus and control costs by only re-ingesting the exact data you need for an investigation.

Agent Performance and Overhead

The resource footprint of the collector agent is a key planning consideration, especially on busy hosts.

Datadog Agent: The APM path is CPU-bound and scales with spans per second. When CPU is constrained, the Agent buffers unprocessed payloads in memory, which can increase memory usage and risk drops. For sizing, Datadog publishes guidance by throughput—for example, ~70 MB at ~58k spans/s and ~130 MB at ~130k spans/s (Agent 7.39 benchmarks according to Datadog's agent resource guide).

Sumo Logic Collector: This is a Java process with a default heap of 128 MB; planning for 256–512 MB is common depending on sources and volume. It’s designed to handle up to ~15,000 events/sec per collector before you scale out.

Which collector is this?

Installed Collector is Sumo’s Java-based collector (the one with a default 128 MB heap, with guidance to plan 256–512 MB depending on sources and volume). Sumo Logic Distribution for OpenTelemetry is a separate OTel-based collector with different packaging and management; choose it if you want OTel semantics and remote management at scale.

The "Life During an Incident" Experience

How you query and investigate during an outage is a crucial differentiator.

In Datadog, an investigation is often a structured, UI-driven workflow. You might start with a dashboard showing a spike in errors, click on a failing service to view its traces, and then pivot to the logs associated with those specific traces. Because data is parsed and tagged upfront in pipelines, filtering is fast and intuitive. This guided experience is excellent for quickly narrowing down known issues.

An investigation in Sumo Logic is typically query-driven and more exploratory. You might start by writing a broad query to search for error messages across all logs from the last 15 minutes. From there, you would iteratively refine the query, adding keywords, parsing fields on the fly, and grouping results to hunt for anomalies. This approach is incredibly powerful for investigating novel or unexpected issues where the data structure isn't known in advance, which is common in security incidents.

Security, Compliance, and Data Residency

Security Stack Depth

Datadog offers a broad security stack, including Cloud SIEM, Application Security Monitoring (ASM), CSPM, vulnerability scanning, and a first-party SOAR capability integrated with Cloud SIEM and Workflow Automation.

Sumo Logic provides a deeply integrated Cloud SIEM with rich SecOps features and a native Cloud SOAR for automation.

Data Residency

Datadog operates multiple sites (US/EU/APAC including Japan/AP1). Always verify product availability per site during evaluation using Datadog's published site availability guidance.

Sumo Logic pins your account to a chosen AWS deployment region, and data stays within that region. Note that the India (Mumbai) region was deprecated on April 30, 2025, with access fully terminating April 30, 2026—confirm current region availability during procurement.

User Experience and Learning Curve

The day-to-day experience of using each platform is quite different.

Datadog is widely praised for its polished, intuitive, and user-friendly UI. It offers many out-of-the-box dashboards and a guided workflow that makes it easy for new users to get started.

Sumo Logic's UI is powerful but complex, with a steeper learning curve. Its interface is built around a query-centric model. Training: both vendors provide free self-paced learning. Sumo also offers free public instructor-led virtual classes, while both vendors charge exam fees for certifications.

Pricing Models and Total Cost of Ownership

Datadog uses a modular, per-product model with several billing dimensions. At a minimum, you’ll size hosts for Infrastructure and/or APM, then layer on usage-based items like logs, RUM sessions, and custom metrics.

Core Datadog SKUs & Units — list pricing as of Oct 2025

Infrastructure Pro: $15 per host/month.
APM: $31 per APM host/month, which also includes a monthly bundle of 1M indexed spans and 150 GB ingested spans per APM host.
Logs: Two levers—ingest and indexing. Ingest is $0.10/GB; indexing is priced per 1M events and scales with retention (e.g., 7 days $1.27, 15 days $1.70, 30 days $2.50 per 1M events). Flex Logs adds a cheaper storage tier with separate query compute.
Custom metrics: Billed per 100 custom metrics.
RUM/Product Analytics: Billed per 1,000 sessions. For example, RUM – Measure is $0.15 per 1K sessions according to Datadog's list pricing.

For a deeper dive into datadog pricing, check out our article on Datadog Pricing Main Caveats Explained.

Sumo Logic primarily uses Flex Licensing, which decouples log ingest from analytics. This means $0 ingest and unlimited users, you pay for storage and scan volume, tracked via credits. This favors a “log everything, pay when you analyze” approach.

How Sumo Logic Flex Works

$0 Ingest for Logs: For non-SIEM logs, credits are consumed by stored volume and scans. Scans happen whenever queries, dashboards, or monitors traverse data. Sumo provides “scans per GB ingested” profiles (e.g., 500–750, 750–1500, 1500–2000) to help you budget based on analytics intensity. This favors a “log everything, pay when you analyze” approach.
Metrics: These are measured in Data Points Per Minute (DPM) for billing and reporting, separate from log scans.

What Actually Drives Your Bill

In Datadog, your bill is primarily driven by the number of infrastructure and APM hosts. It's important to monitor auto-scaling and ephemeral nodes, as APM host counting can be based on a high-watermark model.

Beyond hosts, APM costs are affected by the volume of indexed and ingested spans that exceed the included bundle. You can control this by tuning sampling at the tracer or with an OpenTelemetry Collector.

Logs are often the biggest variable. Costs can be managed by trimming data at ingest with agent filters, selectively indexing only high-value streams with appropriate retention, and using Flex Logs for less frequently accessed data.

Finally, costs for products like RUM and Synthetics are event and session-based, so it's wise to forecast traffic peaks. For custom metrics, costs can be controlled by reducing cardinality and using aggregations.

With Sumo Logic, the main driver of your bill is scan intensity. The more you query, especially with wide time ranges or numerous dashboards, the more scan credits you will consume. This can be managed by right-sizing time ranges and using targeted filters.

Storage and retention are also key factors. You choose the retention period for each data source, and older data kept in "hot" storage costs more than data in cheaper tiers or S3.

Activating security features like Cloud SIEM or Cloud SOAR will be a separate entitlement with its own credit rules.

Lastly, for metrics, high-frequency ingestion increases your Data Points Per Minute (DPM). Downsampling where possible is recommended to control these costs.

Conclusion: Which Platform Is Right for You?

The choice ultimately depends on your primary pain points and team structure.

Choose Datadog if your main priority is best-in-class APM and infrastructure monitoring with a rich set of out-of-the-box dashboards and a user-friendly UI. It allows teams to become productive quickly, but be prepared to actively manage costs.

Choose Sumo Logic if your work is log-centric, with a strong focus on security operations and compliance. Its native SIEM and SOAR capabilities, flexible query-driven investigations, and strong compliance posture (especially PCI DSS) make it ideal for SecOps and regulated environments.

Logs, Metrics, Traces in One Place: Meet SigNoz

If you are weighing Sumo Logic and Datadog, add SigNoz Cloud to your shortlist. You keep your OpenTelemetry setup, get one place to investigate issues, and avoid agent lock-in. For a side-by-side view, see the SigNoz vs Datadog comparison.

Why teams evaluating Sumo and Datadog choose SigNoz Cloud

One UI for incident work

Correlate a slow trace with related logs and service metrics in a click. No context switching, faster root cause.

OpenTelemetry first

Keep the same OTel Collector you already run. Point it to SigNoz Cloud and ship OTLP without re-instrumenting.

Starts hosted, stays flexible

Begin on Cloud for speed. If policy changes, move to BYOC or self-host without changing your instrumentation.

Clear, predictable pricing: Starts at $49/month; then pay for what you use ( $0.30/GB for logs and traces, $0.10 per million metric samples). Unlimited teammates. See pricing.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Switching from Datadog? Follow the Datadog → SigNoz migration guide to map agents, pipelines, and dashboards.

Datadog vs Zabbix in 2025 - Features, Pricing, On-prem vs SaaS, and More

Ankit Anand ✨ — Wed, 12 Nov 2025 08:08:58 +0000

TL;DR:

When to Choose Each

Choose Datadog if: You need a managed (SaaS) platform for full-stack observability (metrics, traces, logs). Your team values speed, minimal setup, and a polished UI, and you have the budget for a per-use subscription.
Choose Zabbix if: Your primary need is on-premise infrastructure and network monitoring. You prioritize zero licensing cost and total data control, and your team has the expertise to install, manage, and scale a self-hosted solution.
Consider SigNoz if: You want the all-in-one (metrics, traces, logs) experience of Datadog but with the open-source, self-hosted flexibility. You're looking for a cost-effective, OpenTelemetry-native alternative without Datadog's high cost or Zabbix's lack of native APM.

Teams looking for a monitoring solution face a key choice: a managed, all-in-one SaaS platform or a powerful, self-hosted open-source tool. Datadog and Zabbix are prime examples of these two distinct models. Datadog is a cloud-native observability platform, while Zabbix is a comprehensive, open-source solution you run yourself, known for its deep capabilities in network and infrastructure monitoring.

In this article, we will help you decide which one fits your specific stack, team, and budget by breaking down their architecture, core features, pricing, deployment models, and ideal use cases.

Datadog vs Zabbix: At a Glance

Feature	Datadog	Zabbix
Model	SaaS (Cloud-Hosted)	Open-Source (Self-Hosted)
Core Focus	Full Observability (Metrics, Traces, Logs)	Infrastructure & Network Monitoring (Metrics)
Pricing	Subscription (per-host, per-GB)	Free Software (TCO: Hardware + Personnel)
APM	Yes, built-in	No (Requires external tools)
Log Analytics	Yes, built-in	No (Basic log file monitoring only)
Setup	Fast (create account, install agent)	Manual (install server, DB, agent)
Best For	Cloud-native, microservices, DevOps teams	On-prem, network hardware, budget-conscious

Let’s go through these aspects in detail in the following sections.

Overview and History

Datadog

Datadog was founded in 2010 as a cloud-based infrastructure monitoring service, designed to bridge the gap between developer and operations teams. Since then, it has expanded through acquisitions and development into a broad, full-stack observability platform. As a public company, it has become a leader in the monitoring industry, especially for organizations adopting cloud-native technologies. Its SaaS-first model emphasizes ease of use and integration with modern tech stacks.

Zabbix

Zabbix began in the early 2000s, with its first stable release in 2004. It has matured over two decades into an enterprise-grade, open-source monitoring solution. Zabbix LLC, the company behind the tool, was established in 2005 in Latvia. As free open-source software (AGPL license), it has a large global community and is designed to track IT infrastructure, including networks, servers, and cloud services. The company offers tiered professional support for organizations that require guaranteed assistance.

Core Features and Capabilities

While both are comprehensive monitoring tools, their scope and focus differ significantly.

Monitoring Capabilities and Data Collection

Datadog offers end-to-end monitoring. It collects metrics from servers, cloud instances, and containers, but its platform also includes:

Application Performance Monitoring (APM): Provides distributed tracing to analyze application code performance.
Log Management: Aggregates and analyzes logs from all your services.
Full-Stack Correlation: Connects metrics, traces, and logs, allowing you to move from a high-level infrastructure metric to a specific trace or log line causing an issue.

Data collection uses a lightweight agent and over 1,000 official integrations for technologies like AWS, Kubernetes, and Slack. It also features Watchdog, a machine-learning engine that automatically detects anomalies without manual threshold setting.

Zabbix focuses on infrastructure and network monitoring. It tracks metrics like CPU load, memory usage, disk space, and network throughput. Its data collection is highly flexible:

Zabbix Agent: An agent installed on hosts to collect metrics.
Agentless Monitoring: Zabbix excels at agentless collection via SNMP (for network devices like routers and switches), IPMI (for hardware sensors), and JMX (for Java applications).
Custom Scripts: Users can define custom metrics (called "items") and triggers for alerting.

Zabbix provides over 300 official templates for common hardware and software. However, it does not natively provide distributed tracing/APM or a dedicated log analytics datastore (it supports log file monitoring via agent items). These functions require integrating external tools.

Visualization and Dashboards

Datadog provides a clean, modern, and intuitive web interface. When you enable an integration, you often get a pre-built, high-quality dashboard automatically. Users can create custom dashboards with a drag-and-drop editor, filtering data in real-time using tags.

Zabbix offers a highly customizable, widget-based dashboard system. While powerful, it has a steeper learning curve. The UI has improved significantly in recent versions but is generally considered less polished. Administrators can design complex views, including network topology maps and slide shows for Network Operations Center (NOC) displays, but this requires more manual effort.

Alerting and Notifications

Datadog's alerting is both threshold-based and ML-driven. The Watchdog feature can surface "unknown unknowns" that a static threshold would miss. Notifications are easily configured for a wide variety of channels, including Slack, PagerDuty, and email, with support for alert grouping to reduce noise.

Zabbix's alerting is based on its powerful "trigger" system. Users define granular, threshold-based rules for any metric. Its key strength is its built-in escalation and remediation capability. You can configure complex workflows, such as notifying a junior admin, then a senior admin, and if the issue persists, automatically running a script to restart a service.

Deployment Model: Cloud vs. On-Premise

This is one of the most significant differences between the two platforms.

Datadog: Pure SaaS

Datadog is delivered as a cloud-based SaaS service. There is no on-premise version.

Setup: You create an account, install agents, and data starts flowing. There is no server installation or maintenance required from you.
Scalability: Datadog's backend scales seamlessly as your usage grows.
Requirement: This model requires a constant internet connection to send your monitoring data to Datadog's cloud. This may be a non-starter for organizations with air-gapped networks or strict data sovereignty rules.

Zabbix: Self-Hosted

Zabbix is software you deploy and run yourself. This gives you complete control.

Setup: You must install and configure the Zabbix server, a database (like MySQL or PostgreSQL), and the web front-end. This can be on an on-premise server, in a private cloud, or on a public cloud VM (like an AWS EC2 instance). Zabbix provides virtual appliances and Docker containers to simplify this process.
Control: You have 100% control over your data, making it ideal for air-gapped or high-security environments.
Maintenance: You are responsible for all maintenance, including OS patching, database management, scaling, and Zabbix software upgrades.
Scalability: Zabbix can scale to monitor thousands of devices, but it requires careful architecture. This is often done using Zabbix Proxies, which collect data from remote locations and send it to the central server, reducing load. Zabbix 6.0 and newer versions also include a native high-availability (HA) option.

Note: Zabbix also offers Zabbix Cloud, a fully managed SaaS option, for teams that prefer not to operate the Zabbix server themselves. However, some item types (e.g., SNMP traps) aren’t supported; upgrades happen during a weekly maintenance window; current docs say you can’t import an existing instance into Cloud(as of October 2025).

Pricing Models and Cost Considerations

The cost structure for these tools is fundamentally different.

Zabbix Pricing

The Zabbix software is completely free and open-source. There are no license fees, regardless of the number of hosts, users, or metrics.

The cost of Zabbix is its Total Cost of Ownership (TCO), which includes:

Infrastructure: The cost of the servers (on-prem hardware or cloud VMs) and storage for your Zabbix server and database.
Personnel: The time and salary of the engineers required to install, configure, maintain, and upgrade the system.
Support (Optional): You can purchase official enterprise support subscriptions from Zabbix LLC in tiered packages for guaranteed assistance.

Datadog Pricing

Datadog is a commercial SaaS product with a pay-as-you-go subscription model. Costs are broken down by product and usage, and it can become complex.

Infrastructure: Infrastructure: Priced per host, per month.
APM: Priced per host or per GB of traced data.
Log Management: Priced per GB of data ingested and retained.

This model is flexible, but costs can increase sharply at scale. Every additional host, container, or GB of logs adds to the monthly bill, requiring active cost management. Datadog offers a 14-day free trial of all features.

Ease of Use and Learning Curve

Datadog

Datadog generally wins on ease of initial setup. Because it is SaaS, you can be up and running in minutes. The UI is modern and intuitive, and pre-built dashboards provide immediate value. While the basic functionality is easy to grasp, mastering the full, complex product suite (especially cost management and advanced querying) has its own learning curve.

Zabbix

Zabbix has a steeper learning curve. The initial setup is more involved, as it requires installing the server and database. Users must learn Zabbix-specific concepts like Hosts, Items, Triggers, and Templates. While recent versions have improved the UI, it can feel overwhelming to new users. Zabbix shines in the hands of a skilled team willing to invest time in learning and customizing it.

Integrations and Ecosystem

Datadog

A key strength for Datadog is its vast ecosystem of 1,000+ plug-and-play integrations (as of October 2025). If you use a common cloud service, database, or developer tool, Datadog almost certainly has a one-click integration for it, complete with a default dashboard.

Zabbix

Zabbix comes with 300+ official templates for common systems. Its true power, however, is in its extensibility. The Zabbix community is very active in sharing custom templates and scripts. You can use UserParameters or external scripts to monitor virtually anything that can be accessed via a command or network protocol.

This is a "plug-and-play" (Datadog) versus "script-and-play" (Zabbix) model. Datadog is faster for common technologies; Zabbix is more flexible for custom or legacy hardware.

Security and Compliance

Datadog

As a managed service, Datadog lists certifications including SOC 2 Type II and ISO 27001, and offers HIPAA-eligible options.. All communication is encrypted by default. Datadog also sells additional security products, like Security Monitoring (SIEM) and Cloud Security Posture Management (CSPM), allowing teams to unify operations and security on one platform.

Zabbix

With Zabbix, security and compliance are your responsibility. Because it is self-hosted, you control the environment. The software provides the necessary features, such as RBAC, LDAP/Active Directory authentication, and TLS encryption for all components. However, you must implement and configure them correctly to meet standards like HIPAA or GDPR. Zabbix does not provide built-in SIEM or security analytics features.

Decision Framework: When to Choose Which?

When to Choose Datadog

You run a cloud-native or microservices-based stack (Kubernetes, AWS, GCP, serverless).
You need full-stack observability (metrics, traces, and logs) correlated in one platform.
You prefer a managed SaaS solution to minimize operational overhead and get started quickly.
You have a budget for a per-use subscription and value speed-to-market over total control.
Your infrastructure is highly dynamic and auto-scales frequently.

When to Choose Zabbix

You are budget-conscious and want to avoid recurring software license fees.
Your primary need is monitoring on-premise infrastructure, network devices (using SNMP), and physical servers.
You have strict data sovereignty or security requirements that mandate a self-hosted or air-gapped solution.
You have a skilled IT/Ops team with the time to manage, tune, and customize a powerful monitoring tool.
You value deep customization and the ability to run automated remediation scripts.

SigNoz: Datadog's All-in-One Features, Zabbix's Open-Source Flexibility

SigNoz Infrastructure Monitoring Module

Zabbix is free and open-source but lacks native APM and log analytics. Datadog offers a full-stack platform but is proprietary and can be very expensive.

If you are looking for a solution that combines the best of both worlds, you should consider SigNoz. SigNoz is an all-in-one observability platform that unifies metrics, traces, and logs in a single application.

It is built on OpenTelemetry, making it vendor-neutral and future-proof. SigNoz provides a full-stack observability experience similar to Datadog but with the flexibility and control of an open-source, self-hostable solution. For teams that want to avoid managing the backend, SigNoz also offers a cloud solution that is often more cost-effective than Datadog.

Get Started with SigNoz

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Hope we answered all your questions regarding Datadog vs Zabbix. If you have more questions, feel free to use the SigNoz AI chatbot, or join our slack community.

You can also subscribe to our newsletter for insights from observability nerds at SigNoz, get open source, OpenTelemetry, and devtool building stories straight to your inbox.