Alexandr Bandurchin for Uptrace

Posted on Oct 16 • Originally published at uptrace.dev

Cloud Microservices Monitoring on AWS and Azure with OpenTelemetry

#azure #microservices #monitoring #aws

Your checkout flow starts in an AWS Lambda function, calls a payment service running on EKS, then triggers notifications through Azure Functions. Three different compute platforms, two cloud providers, one distributed trace that you can't see.

Cloud providers want you to use their native monitoring tools. AWS pushes X-Ray and CloudWatch. Azure promotes Application Insights and Azure Monitor. These tools work well within their ecosystems but lock you into vendor-specific implementations. When you need to trace a request across AWS Lambda and Azure Functions, you're stuck correlating logs manually between two separate systems.

This vendor lock-in creates operational complexity. Your team learns AWS monitoring for some services and Azure monitoring for others. Dashboards don't unify. Alerts need separate configuration. When you migrate workloads between clouds, you rewrite all your observability code. OpenTelemetry eliminates this fragmentation by providing one standard that works across all cloud providers.

Cloud Platform Challenges

Serverless functions create observability gaps. AWS Lambda and Azure Functions spin up, execute, and terminate within seconds. Traditional monitoring assumes persistent processes where you can connect and query state. Ephemeral functions disappear before you finish reading their logs.

Cold starts amplify debugging difficulty. A Lambda function taking 3 seconds to respond might spend 2.5 seconds initializing and 0.5 seconds executing business logic. Without proper instrumentation, you see only the total duration. You can't distinguish initialization overhead from actual performance problems.

Container orchestration adds another layer of complexity. An EKS pod restarts after an out-of-memory error. The error gets logged, but the pod's gone. You have a timestamp and a stack trace pointing to a container that no longer exists. Without centralized trace storage, reconstructing what happened becomes guesswork.

Multi-cloud deployments multiply these issues. Your API Gateway in AWS routes to a service mesh in Azure. Each provider has different metric names, different trace formats, different retention policies. Correlating a slow response across this boundary requires manual work that shouldn't exist.

OpenTelemetry for Cloud

OpenTelemetry provides vendor-neutral APIs that work identically across cloud providers. The same instrumentation code runs on AWS Lambda, Azure Functions, EKS, and AKS. You instrument once and export to any backend, including cloud-native options or independent platforms like Uptrace.

The protocol handles automatic context propagation. When a Lambda function calls an EKS service, OpenTelemetry injects trace headers into the HTTP request. The EKS service extracts these headers and continues the distributed trace. This works regardless of where either service runs—same region, different regions, different cloud providers entirely.

Lambda and Azure Functions both support OpenTelemetry through layers and extensions. These don't require code changes for basic tracing. HTTP requests, database queries, and downstream service calls get instrumented automatically. You add custom spans only for business-specific operations that need visibility.

Managed Kubernetes services like EKS and AKS benefit from OpenTelemetry's operator pattern. Deploy the OpenTelemetry Collector as a DaemonSet, and it automatically discovers pods, collects metrics, and forwards traces. Services running in pods just need to export to localhost. The collector handles the rest.

AWS Lambda Monitoring

Lambda functions execute in response to events. An API Gateway request, an S3 file upload, a DynamoDB stream update—each triggers a function invocation. OpenTelemetry captures these invocations as traces with the event source as context.

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.aws_lambda import AwsLambdaInstrumentor

# Configure tracer
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
    endpoint="https://otlp.uptrace.dev:4317",
    headers={"uptrace-dsn": os.environ["UPTRACE_DSN"]}
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

# Instrument Lambda handler
AwsLambdaInstrumentor().instrument()

def lambda_handler(event, context):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("process_order") as span:
        order_id = event.get("order_id")
        span.set_attribute("order.id", order_id)
        span.set_attribute("faas.trigger", event.get("source"))

        # Business logic
        validate_order(order_id)

    return {"statusCode": 200, "body": "processed"}

The AwsLambdaInstrumentor wraps your handler function automatically. It captures the event source, function name, request ID, and cold start status without manual instrumentation. These attributes help distinguish cold starts from warm invocations when analyzing latency.

Lambda layers simplify deployment. Package the OpenTelemetry SDK and dependencies as a Lambda layer. Reference this layer in your function configuration. Now multiple functions share the same instrumentation code without duplicating it in each deployment package.

# serverless.yml
functions:
  processOrder:
    handler: handler.lambda_handler
    layers:
      - arn:aws:lambda:us-east-1:123456789:layer:otel-python:3
    environment:
      UPTRACE_DSN: ${env:UPTRACE_DSN}

Cold start metrics become visible. OpenTelemetry marks the first invocation after deployment or scaling. You can filter traces by cold start status to see initialization overhead separately from execution time. This helps optimize Lambda configuration like memory allocation and provisioned concurrency.

EKS Microservices Tracing

Amazon EKS runs Kubernetes workloads in AWS. Pods start and stop based on load. OpenTelemetry's Kubernetes operator deploys collectors alongside your services to capture traces and metrics.

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      batch:
        timeout: 1s

      k8sattributes:
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.pod.name

    exporters:
      otlp:
        endpoint: otlp.uptrace.dev:4317
        headers:
          uptrace-dsn: ${UPTRACE_DSN}

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [k8sattributes, batch]
          exporters: [otlp]

The k8sattributes processor enriches spans with Kubernetes metadata. Pod name, namespace, and deployment information attach to every span. When a pod crashes, you still have its traces with full context about which deployment and replica set it belonged to.

Service mesh integration provides deeper visibility. AWS App Mesh or Istio can propagate trace context automatically without code changes. The sidecar proxies inject and extract trace headers for every service-to-service call.

# Payment service in EKS
from flask import Flask
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/charge')
def charge_card():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("stripe_charge") as span:
        span.set_attribute("payment.amount", amount)
        # Stripe API call

    return {"status": "charged"}

The service exports to the OpenTelemetry Collector running in the same cluster. The collector adds Kubernetes attributes and forwards to Uptrace. Even when this pod terminates, the trace persists with full metadata about which pod processed the request.

Azure Functions Instrumentation

Azure Functions work similarly to Lambda but integrate with Azure's ecosystem. OpenTelemetry supports both in-process and isolated worker functions. The instrumentation pattern stays consistent with Lambda.

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

// Configure tracer
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
  url: 'https://otlp.uptrace.dev:4317',
  headers: {
    'uptrace-dsn': process.env.UPTRACE_DSN
  }
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Auto-instrument HTTP
registerInstrumentations({
  instrumentations: [new HttpInstrumentation()]
});

module.exports = async function (context, req) {
  const tracer = require('@opentelemetry/api').trace.getTracer('notification-service');

  const span = tracer.startSpan('send_notification');
  span.setAttribute('user.id', req.query.userId);
  span.setAttribute('faas.trigger', 'http');

  // Notification logic
  await sendEmail(req.query.userId);

  span.end();

  context.res = { status: 200, body: 'sent' };
};

Azure Application Insights can coexist with OpenTelemetry. If you need Azure-specific features like application maps or smart detection, keep Application Insights enabled. OpenTelemetry exports to both Application Insights and external backends simultaneously.

Durable Functions require special handling. These long-running orchestrations span multiple function invocations. OpenTelemetry tracks each activity function separately but needs correlation IDs to link them into one logical operation.

const df = require('durable-functions');

df.app.orchestration('orderWorkflow', function* (context) {
  const tracer = require('@opentelemetry/api').trace.getTracer('orchestrator');
  const span = tracer.startSpan('order_orchestration');

  const orderId = context.df.getInput();
  span.setAttribute('order.id', orderId);

  // Each activity creates its own span
  yield context.df.callActivity('validateOrder', orderId);
  yield context.df.callActivity('chargeCard', orderId);
  yield context.df.callActivity('shipOrder', orderId);

  span.end();
});

Each activity function exports traces with the same correlation ID. Uptrace links these spans into one distributed trace showing the complete workflow duration and each step's timing.

AKS Container Monitoring

Azure Kubernetes Service provides managed Kubernetes in Azure. The OpenTelemetry setup mirrors EKS but uses Azure-specific integrations for authentication and networking.

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      batch:
      k8sattributes:
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.node.name
            - k8s.pod.name
            - k8s.pod.uid

    exporters:
      otlp:
        endpoint: otlp.uptrace.dev:4317
        headers:
          uptrace-dsn: ${env:UPTRACE_DSN}

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [k8sattributes, batch]
          exporters: [otlp]

Azure Monitor Container Insights can run alongside OpenTelemetry. Container Insights collects node-level metrics and logs. OpenTelemetry handles application traces and custom metrics. This dual approach gives both infrastructure visibility and application observability.

Virtual nodes running Azure Container Instances integrate seamlessly. These serverless pods scale beyond the cluster's node capacity. OpenTelemetry treats them identically to regular pods, maintaining trace context even as workloads move between VMs and serverless containers.

Cross-Cloud Trace Propagation

The real power emerges when services span cloud providers. An order starts in AWS API Gateway, validates in a Lambda function, processes payment in an EKS service, then sends notifications through Azure Functions.

# AWS Lambda - Order validation
import requests
from opentelemetry import trace

def lambda_handler(event, context):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("validate_and_forward") as span:
        order = event["body"]
        span.set_attribute("order.total", order["total"])

        # Call EKS payment service
        response = requests.post(
            "https://payments.company.com/charge",
            json=order,
            # Trace context automatically injected
        )

        # Call Azure Functions notification
        if response.status_code == 200:
            requests.post(
                "https://notifications.azurewebsites.net/api/send",
                json={"order_id": order["id"]}
            )

        return {"statusCode": 200}

The requests library automatically propagates trace context through HTTP headers. The EKS service running in AWS extracts the context and continues the trace.

# EKS Payment Service
from flask import Flask, request
from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/charge', methods=['POST'])
def charge():
    # Trace context automatically extracted from request headers
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("process_payment") as span:
        order = request.json
        span.set_attribute("payment.provider", "stripe")

        # Payment processing
        charge_card(order)

    return {"status": "success"}

The Azure Function receives the same trace context and adds its spans to the distributed trace.

// Azure Functions - Notifications
module.exports = async function (context, req) {
  // Trace context extracted automatically
  const tracer = require('@opentelemetry/api').trace.getTracer('notifications');

  const span = tracer.startSpan('send_email');
  span.setAttribute('notification.type', 'order_confirmation');

  await sendEmail(req.body.order_id);
  span.end();

  context.res = { status: 200 };
};

In Uptrace, you see one continuous trace. It starts in AWS Lambda, moves to EKS, crosses to Azure Functions, and shows every span with full context. The cloud provider boundaries are invisible at the observability layer.

Serverless Performance Patterns

Cold starts dominate serverless latency. OpenTelemetry captures initialization time separately from execution time. You can compare cold start overhead across different memory configurations to optimize cost versus performance.

Provisioned concurrency eliminates cold starts but costs more. OpenTelemetry metrics show what percentage of invocations benefit from warm instances. If 95% of requests hit warm functions, provisioned concurrency might not justify its cost.

Memory allocation affects Lambda and Azure Functions performance non-linearly. Higher memory gives more CPU power. OpenTelemetry duration metrics reveal the sweet spot where increasing memory stops improving execution time.

Timeout configuration needs data-driven decisions. Functions timing out waste money and produce errors. OpenTelemetry's P95 and P99 latency metrics show what timeout value covers most requests while avoiding false failures.

Unified Observability Strategy

Start with service boundaries. Instrument each microservice with OpenTelemetry regardless of where it runs. Lambda functions, containers, and long-running processes all use the same SDK configured for their runtime environment.

Deploy collectors strategically. Lambda functions export directly to Uptrace via OTLP. Container platforms run collectors as DaemonSets or sidecars that batch and forward traces. This minimizes egress costs while maintaining low latency.

Use consistent semantic conventions. HTTP server spans should have http.method, http.status_code, and http.target attributes everywhere. Database spans need db.system, db.statement, and db.name. This consistency makes cross-service analysis trivial.

Configure sampling intelligently. High-traffic Lambda functions can sample at 10% to control costs. Low-traffic Azure Functions should capture 100% of traces. Head-based sampling in the SDK ensures complete traces even when sampling rates differ.

Set up alerting on SLOs. Define service-level objectives like "99% of payment requests complete under 500ms" and configure alerts when metrics breach these thresholds. OpenTelemetry provides the metrics; Uptrace triggers the alerts.

Migration from Native Tools

Existing AWS X-Ray or Application Insights deployments can migrate gradually. Run OpenTelemetry alongside native tools during the transition. Both can export to their respective backends while you verify OpenTelemetry captures everything correctly.

AWS X-Ray SDK and OpenTelemetry can coexist in the same Lambda function. The X-Ray SDK creates segments while OpenTelemetry creates spans. After verifying equivalence, remove the X-Ray SDK and its dependency.

Application Insights has a compatibility mode that accepts OpenTelemetry data. Set the exporter to Application Insights' endpoint. This preserves existing dashboards while you build new ones in Uptrace.

Container workloads migrate easiest. Deploy the OpenTelemetry Collector, update service endpoints to export to localhost:4317, and remove vendor-specific exporters. The application code doesn't change; only the collector configuration differs.

Implementation with Uptrace

Uptrace provides a unified interface for microservices across all cloud providers. Configure Lambda functions, EKS services, Azure Functions, and AKS workloads with the same project DSN.

# Same configuration works everywhere
UPTRACE_DSN = "https://<token>@uptrace.dev/<project_id>"

otlp_exporter = OTLPSpanExporter(
    endpoint="https://otlp.uptrace.dev:4317",
    headers={"uptrace-dsn": UPTRACE_DSN}
)

Service maps show dependencies across clouds. A Lambda function calling an AKS service appears as one edge in the graph. Request rates, latencies, and error percentages are visible for each connection.

Traces span cloud boundaries seamlessly. Click a slow request in the dashboard and see its complete path through AWS, your data center, and Azure. Each span includes cloud-specific metadata like function ARN, pod name, or container instance ID.

Cost optimization becomes possible. Compare Lambda memory configurations by filtering traces by function version. Analyze EKS pod resource requests against actual usage patterns. Azure Functions consumption plan costs correlate with invocation metrics.

Ready to unify your cloud observability? Start with Uptrace.

You may also be interested in:

DEV Community