Çağrı Bayram for AWS Community Builders

Posted on Jan 13

From Fragmented Monitoring to Unified Observability

#cncf #aws #ecs #opentelemetry

For the past 8 years, Lifemote Networks has been revealing home Wi-Fi insights to our customers from major Tier 1 operators to Tier 3 altnets across Europe. We are analyzing billions of events to help ISPs to provide an impeccable Wi-Fi experience and boost their customers' satisfaction. With these analyses, ISPs can detect problems as needed and spot issues before customers notice them.

Why Observability Matters to Us?
Observability is essential for every company, but as a company built on delivering insights, we must have deep visibility into our own products. Collecting precise metrics and meaningful logs is essential for troubleshooting on our platform. They provide crucial context for root cause analysis. However, without distributed tracing, debugging cross-service issues required hours of manual log correlation across multiple dashboards. We had to find a way to transform monitoring into unified observability.
We're strong believers in CNCF projects and have successfully implemented several. Given how critical observability has become to our infrastructure, we explored the CNCF landscape and found our answer.

The Old Monitoring Stack

Metrics Collection: Our stack consisted of Prometheus, Grafana OSS, AWS-distributed Fluent Bit, Firelens/Firehose, OpenSearch and CloudWatch. To collect metrics, we followed this pattern: Our services exposed metrics through the /metrics endpoint, the Prometheus collector service scraped these exposed metrics and wrote them to a central Prometheus using remote_write. Central Prometheus was added as a data source in Grafana OSS.  

Log Collection: Collecting logs was more complex than collecting metrics; some of our services generated raw application logs. We had two products with different logging patterns - one used Fluent Bit sidecars, the other logged directly to CloudWatch. Services using Fluent-bit had a sidecar, which was the fluent-bit image distributed by AWS. Fluent Bit sidecars collected logs, parsed them to JSON, and forwarded them to OpenSearch via Kinesis Firehose through FireLens. Meanwhile, the other services sent their logs directly to CloudWatch.  

Tracing Gap: Our tracing was relatively new back then. It was not fully operational across our production environments. We were using Jaeger only in non-prod environments to test and see how the process was going. Without distributed tracing in production, debugging cross-service issues meant manually stitching together logs from multiple systems. Our biggest gap: no trace-log correlation.

Our Case

The Problem
Like traditional monitoring solutions, ours weren't linked either. All signals are collected and stored without any correlation, making it nearly impossible to trace issues from end to end and requiring manual, time-consuming investigations across multiple visualization platforms. We knew we had to improve the current stack to address the problems quickly.

Why OpenTelemetry? 
After an extensive search, we concluded that OpenTelemetry and OpenTelemetry Collector would be the answer. Our trace stack was already using OpenTelemetry Protocol (OTLP) libraries, so there is no need to make any changes on the trace side.   

The Log Challenge
The main challenge was the logs; they are notoriously difficult. Each log provider/library uses its own standard, so you need to specify how to parse it; if you change one part, you need to reconfigure the parser. If you don't use the correct timestamp format when sending logs from the log producer to the log aggregator, your logs can seem to arrive too early or appear out of order. 
The OpenTelemetry team was also aware of this challenge. As stated in their official documentation:

"For logs we did not take the same path. We realized that there is a much bigger and more diverse legacy in logging space. There are many existing logging libraries in different languages, each having their own API. Many programming languages have established standards for using particular logging libraries. For example in Java world there are several highly popular and widely used logging libraries, such as Log4j or Logback."  This is why OpenTelemetry provides bridge libraries to extend existing logging libraries like Zap and Loguru, rather than forcing adoption of a new native logging API. While metrics and traces have fully fledged native libraries, logs rely on these bridge adapters to maintain compatibility with established logging ecosystems.

The Central Component of Observability: OpenTelemetry

OpenTelemetry (OTel) is an open-source observability framework that provides a unified approach to collecting and exporting telemetry data - metrics, logs, and traces. It has numerous advantages like:

Vendor Neutrality: OTel is vendor-agnostic with almost no vendor lock-in, and works with a broad spectrum of observability products. As long as your signals are produced in the OpenTelemetry Protocol (OTLP) format, you can choose your observability backend with greater flexibility. Want to switch from Grafana OSS to Grafana Cloud or from any visualization platform to Grafana? No problem, just reconfigure your collector's configuration file - no application code changes required. 

Unified Telemetry: While metrics, logs, and traces each serve important purposes, they traditionally remain fragmented across different tools. OpenTelemetry changes this by enabling services to emit all three signal types with shared context and identifiers.
When investigating an issue, you can seamlessly trace from a high-level metric anomaly down to the exact log line and the distributed trace that caused it - all connected through trace_id and span_id. In our old stack, debugging cross-service issues required hours of manual correlation across dashboards. With OTel, we can trace from a metric spike to the exact log entry and distributed trace—all linked by trace_id and span_id.
This unified, vendor-neutral approach addresses our core challenge: Moving from siloed monitoring tools to correlated observability.

Collector Types: OTel Collector offers Core, Contrib, K8s, and many more distributions. We use Contrib because it includes the Grafana Cloud exporter and processors essential for our multi-cloud environment.
 If you would like to review collector distributions in detail: Collector Distributions

Community Support

Our positive experience with CNCF projects like Prometheus, Fluent Bit, and Jaeger gave us confidence in adopting OpenTelemetry. The active community meant we could rely on timely support, regular updates, and battle-tested solutions.

Kubernetes is undoubtedly the most active and best-supported project within the foundation. With its large community, regular updates, and comprehensive documentation, Kubernetes has become the de facto standard for containerization and orchestration. Beyond Kubernetes, OpenTelemetry stands out as another CNCF project with strong community support and active development. It has become the industry standard for observability, much like Kubernetes has for orchestration.

Our Implementation

Architecture Overview
We have multiple products that generate logs/metrics/traces. Every product is written a different language, architected with different approaches, but the infrastructure is somewhat similar. Regardless of the product, we use ECS as a compute platform; all services work on ECS.

Why Sidecar Pattern?
 OTel Collector can run as a standalone service or as a sidecar. We chose the sidecar pattern for two reasons:

Service Mesh Compatibility: Our system uses AWS App Mesh. Rather than modifying mesh rules to allow tasks to reach an external collector, we bypassed the complexity entirely with sidecars.
Failure Isolation: Sidecars provide failure isolation. If a standalone collector fails, the entire cluster loses observability. With sidecars, only the affected task's signals are lost.

Instrumentation: Due to the simplicity of HTTP, applications send the signals to collectors through HTTP. We have Go and Python as the programming languages. Since programming languages differ, logs are the most diverse.

Golang: In the Go project, we were using Zap and added the otelzap bridge to send Zap logs to OpenTelemetry. We ensured each log includes trace context for correlation.
OpenTelemetry Zap Log Bridge

Python: In Python, we already had a custom logging library that extends loguru, we added a Sink class for OTLP log support. In order to achieve trace-log correlation, we convert loguru log to an OpenTelemetry LogRecord and inject trace_id, span_id, and trace_flags.

class OTLPSink:
    def __init__(self, otel_logger):
        self.otel_logger = otel_logger

    def __call__(self, message):
        record = message.record
        attributes = {
            "file": record["file"].path,
            "function": record["function"],
            "line": record["line"],
        }

        if record.get("extra"):
            for k, v in record["extra"].items():
                attributes[k] = str(v)

        span = trace.get_current_span()
        span_context = span.get_span_context()

        log_record = LogRecord(
            timestamp=int(time.time() * 1_000_000_000),
            trace_id=span_context.trace_id if span_context.is_valid else 0,
            span_id=span_context.span_id if span_context.is_valid else 0,
            trace_flags=span_context.trace_flags if span_context.is_valid else 0,
            severity_text=record["level"].name,
            severity_number=get_severity_number(record["level"].name),
            body=str(record["message"]),
            attributes=attributes,
            resource=getattr(self.otel_logger, "_resource", None)
        )
        self.otel_logger.emit(log_record)

All these logging library changes are backward compatible, meaning we can always return to our previous stack with a single environment variable change, and of course, re-deployment of tasks.

Signal Processing Pipeline: We want to control what happens to our signals and make additional modifications when they arrive at our central OTel Collector before reaching their final destination - in our case, Grafana Cloud. This gives us extensive control over dropping, transforming, and sampling data.  

Sidecar Collectors: Drop high-volume, low-value signals from endpoints like /test, /ping, and /metrics.  Central Collector: We use tail sampling for traces. We evaluate spans within a trace for 50 seconds. If the trace contains no spans from critical services, has no attributes marked for retention, shows no errors, and all spans complete within our latency threshold, the collector drops the entire trace, as we don't need it.

We use tail sampling for traces, which evaluates all spans in a trace before making a sampling decision (see OTel sampling docs). We evaluate spans within a trace for 50 seconds.

Operational Reliability: The central collector task runs within an ECS service. If the collector fails, the ECS service will replace the task. We implemented alarms for internal metrics to catch issues early. Alarms for failed/refused signals or queue sizes send notifications to Slack channels.

Configuration Management: OTel Collector supports hot reload for config, but hot reload must not be intervened manually; you need a proper CI/CD pipeline to validate and deploy new config, otherwise you may lose observability data.

Wait, where are the Metrics?

The Cost Challenge: As mentioned earlier, we have multiple identical environments, each scraping application metrics and sending them to the central Prometheus via remote_write. Our products generate ~130K active series at any given time, significantly exceeding Grafana Cloud's included limits (10K active series). Moving all metrics to Grafana Cloud would be costly: for every 1K metrics beyond 10K, Grafana Cloud charges $6.50. Pricing of Grafana Cloud: https://grafana.com/pricing/  

Our Solution: We decided to continue using our existing Prometheus and add it as the default metric data source in Grafana Cloud. Grafana only bills if the signal is ingested on the Grafana side; this approach saved us $780/month.

The Impact: Distributed Tracing in Action

Traces become valuable when correlated with logs, revealing the full request lifecycle across multi-service architectures.
The primary goal of this migration was to implement distributed tracing and correlate it with logs.

Before distributed tracing, to find issues, there should be extensive debugging across multiple backends. As an example:
Check metrics from Grafana and identify the bottleneck for an app/aws service
Check logs in OpenSearch in that timeframe
Try to correlate them to find the root cause
Repeat for each service

With distributed tracing:

Search for errors in Grafana (automatic error filtering)
Click on the failing trace
See the entire request path across all services and jump directly to related logs

With distributed tracing, each service call is visible, who called whom, how long each step took, and even slow database queries are immediately identifiable.

This trace shows an analysis request for analytical workloads. The visualization immediately reveals:

The analysis service (Service B) is responsible for the duration
Downstream services (Service A, database, redis) complete quickly
The bottleneck is computational, not I/O or database-related

In Grafana's trace view, each node and connection is clickable and allows drill-down into individual spans, logs, and performance metrics for deeper investigation.

Before distributed tracing, understanding this request flow would have required manual correlation and been more time-consuming.

Managing High Trace Volumes

As your services communicate, the services tend to produce more and more spans.

As mentioned, the visualization backend we are using is Grafana Cloud. Grafana Cloud is the last stop for the signals. A huge amount of trace data could easily overwhelm the central collector. During our initial deployment, we encountered a Memory Exhaustion (OOM) issue.
Memory was filling up before the queue reached its limit, causing the collector to crash. Later, Grafana Cloud's rate limiting reduced exporter throughput, causing backpressure and eventually filling the trace queue.

To ensure production stability, we implemented three key strategies:

Memory Limiter Processor: The memory_limiter processor is implemented as the first line of defense. This ensures the collector proactively drops data or triggers garbage collection before the ECS task hits its hard memory limit.
Coherent Queue Management: Every link, attribute, resource attribute increases the size of the span. The queue_size and memory limits must be carefully synchronized by taking the span size into consideration. If the queue is configured to hold more data than your memory can handle, the collector will crash under pressure.
Refined Tail Sampling: To keep the volume meaningful and cost-effective, we narrowed our 50-second tail sampling window. We now prioritize:

Traces containing ERROR statuses.
Traces with high latency (spans exceeding 15 seconds).

These configurations are not final; every system needs its own fine-tuned configurations

Beyond Observability: AWS Metrics and Private Data

AWS Service Metrics via CloudWatch
Application signals are collected via OpenTelemetry, and Prometheus scrapes application metrics. Apart from application insights visibility into AWS services, such as ECS task health, EC2 (our ECS clusters run on EC2 instances), memory and CPU metrics, ALB performance, etc., is also crucially important. We configured Grafana Cloud to pull these metrics directly from CloudWatch using cross-account IAM roles, avoiding the need to replicate AWS metrics into our Prometheus infrastructure.

Private Database Access via Grafana Private Data Connect
Our device metadata, such as device types and analysis success/failure rates, lives in a private RDS instance. To visualize this data in Grafana Cloud without exposing the database publicly, we use Grafana Private Data Connect (PDC). This allows secure, private connectivity between Grafana Cloud and our RDS instance. PDC runs as a containerized agent on ECS Fargate, establishing a secure tunnel between Grafana Cloud and our VPC.

Challenges

Identical Multi-Account VPC Architecture

Our infrastructure is deployed from a single codebase, so all AWS resources are identical. Since infrastructure configurations are identical, VPC CIDR blocks overlap across accounts. This design is operationally efficient, but there are drawbacks with connectivity between accounts:

VPC Peering: Peering is not achievable with the same CIDR, and there is no way to implement a transitive connection
Transit Gateway (TGW): Transitive connection is possible, but the CIDR overlap problem remains the same
PrivateLink: Technically possible, but adds unnecessary cost and complexity for our use case

Solution: Public Endpoints with Strong Security
Instead of insisting on private connectivity, we chose public endpoints with strong security controls:

Central OTel Collector: Exposed via Application Load Balancer with HTTPS (TLS 1.3)
Authentication: Bearer token authentication enforces access control. Tokens are managed via AWS Secrets Manager and distributed across accounts through Terraform Cloud.

Conclusion

What We Achieved
With OpenTelemetry, fragmented monitoring transformed into unified observability:

Correlated signals: Traces, logs, and metrics linked by trace_id and span_id
Vendor flexibility: OTLP-based architecture allows backend changes without code modification
Production-ready tracing: Full distributed tracing across all services
Operational visibility: All observability data can be accessed from a single platform

AWS-Native Tools vs OpenTelemetry
AWS offers native observability tools like X-Ray, CloudWatch, and CloudWatch Logs Insights. We evaluated them, but chose OpenTelemetry for:

Vendor neutrality: Not locked into the AWS ecosystem or any other vendor
Unified standard: Single instrumentation for all signals
Community momentum: Active CNCF project with broad industry adoption

What's Next: Continuous Profiling

The observability journey will continue, the next signal will be the continuous profiling - the "fourth pillar" of observability. The next steps include:

Pyroscope integration with OpenTelemetry
CPU and memory profiling for Go and Python services
Linking profiles to traces for performance root cause analysis

Profiling will complete our observability stack, giving us code-level visibility into performance bottlenecks correlated with distributed traces.

Thanks

We hope you found this useful. Happy observing!

DEV Community