Cláudio Filipe Lima Rapôso

Posted on May 14

Unified Multicloud Observability: Orchestrating Distributed Tracing with OpenTelemetry

#azure #aws #terraform #python

Distributed architectures spanning Amazon Web Services and Microsoft Azure frequently suffer from observability fragmentation. When a request originates in an AWS-hosted API Gateway, traverses a set of microservices, and triggers a specialized worker in Azure, the visibility chain often breaks at the provider boundary. Engineering teams find themselves navigating isolated dashboards in CloudWatch and Azure Monitor, attempting to manually correlate timestamps and request IDs while the Mean Time To Resolution (MTTR) escalates. This lack of a unified telemetry pipeline conceals latent bottlenecks and makes identifying the root cause of cross-cloud failures nearly impossible. The definitive architectural solution is the implementation of a vendor-agnostic OpenTelemetry (OTel) Collector mesh. By standardizing the collection, processing, and exportation of traces, metrics, and logs using the W3C Trace Context, organizations can achieve a single, high-fidelity view of the entire request lifecycle across heterogeneous cloud environments (Sridharan, 2018).

Prerequisites

Implementing this observability framework requires Terraform version 1.8.0 or higher to manage cross-cloud provider configurations. You must utilize the AWS provider (version 5.40+) and the AzureRM provider (version 3.90+). The instrumentation logic requires Python 3.12 with the opentelemetry-api, opentelemetry-sdk, and opentelemetry-exporter-otlp libraries. A deep understanding of distributed tracing concepts, specifically spans, traces, and context propagation, is essential. Furthermore, you must ensure that your network topology allows for outbound traffic to the OpenTelemetry Collector endpoints over gRPC or HTTP/protobuf.

Step-by-Step

Provisioning the OpenTelemetry Collector Infrastructure

The initial phase involves deploying the OpenTelemetry Collector as a gateway service in both AWS and Azure. The collector acts as a centralized processing unit that receives telemetry data from local microservices, applies transformations, and exports it to a backend of choice, such as Jaeger, Honeycomb, or Managed Grafana. We utilize Terraform to deploy the collector on AWS Elastic Container Service (ECS) and Azure Container Instances (ACI). By positioning the collector close to the services, we minimize the impact of telemetry export on application latency. The architectural justification for this mesh is the decoupling of instrumentation from the backend destination. This ensures that if the organization decides to switch observability vendors, only the collector configuration needs to be updated, rather than refactoring the code across the entire microservice ecosystem.

# observability/otel_collector.tf
resource "aws_ecs_task_definition" "otel_collector" {
  family                   = "otel-collector-task"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"
  memory                   = "1024"

  container_definitions = jsonencode([
    {
      name  = "aws-otel-collector"
      image = "public.ecr.aws/aws-observability/aws-otel-collector:latest"
      command = ["--config=/etc/otel-collector-config.yaml"]
      portMappings = [
        { containerPort = 4317, hostPort = 4317, protocol = "tcp" }, # gRPC
        { containerPort = 4318, hostPort = 4318, protocol = "tcp" }  # HTTP
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/otel-collector"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "otel"
        }
      }
    }
  ])
}

resource "azurerm_container_group" "otel_collector_azure" {
  name                = "aci-otel-collector"
  location            = "East US"
  resource_group_name = "rg-observability"
  ip_address_type     = "Private"
  os_type             = "Linux"

  container {
    name   = "otel-collector"
    image  = "otel/opentelemetry-collector-contrib:latest"
    cpu    = "0.5"
    memory = "1.5"

    ports {
      port     = 4317
      protocol = "TCP"
    }
  }
}

This deployment establishes the necessary ingress points for our telemetry data. How do we ensure that a trace initiated in an AWS Lambda function persists its unique identity when it crosses the wire to trigger an Azure Function?

Implementing W3C Trace Context Propagation

We maintain the continuity of a trace by implementing explicit context propagation using the W3C Trace Context standard. In a multicloud Hexagonal Architecture, the instrumentation layer must inject trace headers into outgoing HTTP requests and extract them from incoming requests. We utilize Python OpenTelemetry SDKs to automate this process. When an AWS-hosted service calls an Azure-hosted endpoint, the SDK injects a traceparent header into the request. The Azure service then extracts this header to ensure that all spans generated within the Azure environment are associated with the original Trace ID. This standardized propagation is the technological glue that prevents trace fragmentation and allows observability tools to reconstruct the full sequence of events across cloud boundaries.

# observability/tracing_adapter.py
import requests
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

class CrossCloudTracer:
    def __init__(self, service_name: str):
        self.tracer = trace.get_tracer(service_name)
        self.propagator = TraceContextTextMapPropagator()

    def perform_cross_cloud_call(self, target_url: str, payload: dict):
        """
        Injects W3C Trace Context headers into the outgoing request
        to ensure trace continuity between AWS and Azure.
        """
        with self.tracer.start_as_current_span("outgoing-request-to-azure") as span:
            headers = {"Content-Type": "application/json"}

            # Injecting the current span context into the headers
            self.propagator.inject(headers)

            span.set_attribute("http.url", target_url)
            span.set_attribute("cloud.platform", "multicloud_bridge")

            try:
                response = requests.post(target_url, json=payload, headers=headers, timeout=10)
                response.raise_for_status()
                span.set_status(trace.Status(trace.StatusCode.OK))
                return response.json()
            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                raise

# Example extraction in the receiving service
def handle_incoming_request(request_headers: dict):
    # Extract context from incoming W3C headers
    context = TraceContextTextMapPropagator().extract(carrier=request_headers)

    # Start a new span as a child of the extracted context
    tracer = trace.get_tracer("azure-receiver-service")
    with tracer.start_as_current_span("process-incoming-payload", context=context):
        print("Processing request with trace continuity...")

Effective context propagation ensures we have the data, but what mechanism prevents a massive flood of telemetry spans from overwhelming our network and exponentially increasing cloud egress costs?

Optimizing Through Head-Based and Tail-Based Sampling

Managing the volume of telemetry data in a high-throughput multicloud environment is critical to maintaining both performance and cost-efficiency. We address this by implementing sophisticated sampling strategies within the OpenTelemetry Collector. Head-based sampling occurs at the start of a trace, where a decision is made to sample a fixed percentage of requests (e.g., 5% of all traffic). However, to capture the most valuable data, we implement tail-based sampling in the collector mesh. This strategy allows the collector to inspect the entire trace after all spans have arrived before deciding whether to keep it. We configure the collector to prioritize traces that include error statuses or high latency values. This ensures that while we reduce overall data volume, we retain 100% of the traces that indicate system degradation, providing maximum diagnostic utility without the overhead of full tracing.

# observability/sampling_logic.py
# While sampling is often done in the OTel Collector YAML, 
# it can be influenced by application-level logic.

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased

def get_sampler(rate: float = 0.1):
    """
    Returns a sampler that respects the parent's sampling decision 
    but defaults to a specific ratio for new traces.
    This prevents broken traces in a multicloud flow.
    """
    # ParentBased ensures that if the AWS service sampled the trace, 
    # the Azure service will also sample it.
    return ParentBased(root=TraceIdRatioBased(rate))

Common Troubleshooting

A frequent obstacle in cross-cloud tracing is the "Context Gap," where an intermediary load balancer or proxy strips the traceparent header. If your traces appear as separate, disconnected segments in your backend, verify that your AWS Application Load Balancer or Azure Application Gateway is configured to preserve custom headers. In Azure Application Gateway, ensure that the rewrite rules do not accidentally remove non-standard headers.

Another common issue involves clock drift between providers. If spans in Azure appear to start before the parent spans in AWS, the visualization will be distorted. While NTP (Network Time Protocol) synchronization is standard, significant drift can occur. In such cases, use the OpenTelemetry Collector's attributes processor to inject a cloud.region or provider.timestamp attribute, allowing for post-processing alignment in your observability backend.

Finally, verify IAM and Managed Identity permissions. If the OTel Collector fails to export data, check for Access Denied errors in its internal logs. The AWS ECS task role requires permissions to write to X-Ray or CloudWatch if using those exporters, and the Azure Managed Identity must have the Monitoring Metrics Publisher role for Azure-specific backends.

Conclusion

Orchestrating observability across AWS and Azure is no longer a luxury but a requirement for modern cloud-native reliability. By deploying a federated OpenTelemetry Collector mesh and enforcing W3C Trace Context propagation, you eliminate the visibility gaps inherent in multicloud deployments. Implementing tail-based sampling further ensures that your observability remains cost-effective while focusing on the high-value traces necessary for debugging. As a next logical step, consider integrating OpenTelemetry Metrics and Logs into this same pipeline to achieve "Three Pillars" correlation, allowing you to pivot from a latent trace directly to the underlying container logs and resource metrics with a single click.

References

Majors, C., Fong-Jones, L., & Stopford, B. (2022). Observability engineering: Achieving predictability and reliability in complex systems. O'Reilly Media.

OpenTelemetry Authors. (2024). W3C Trace Context specification. World Wide Web Consortium. https://www.w3.org/TR/trace-context/

Sridharan, M. (2018). Distributed tracing in practice: Instrumenting, analyzing, and debugging microservices. O'Reilly Media.

Young, T. (2021). Mastering distributed tracing: Analyzing the performance of complex distributed systems. Packt Publishing.

DEV Community