ajithmanmu

Posted on Jun 26

Async Tracing on AWS Lambda: Carrying Context Across SQS with OpenTelemetry

#aws #opentelemetry #serverless #observability

Tracing synchronous services is mostly a solved problem. Call a downstream service, the SDK injects headers, the trace connects. One chain, full context.

Async is different. Take a Stripe webhook pipeline: payment.failed comes in, your producer Lambda validates it and puts it on SQS, your consumer processes it — updates the subscription, notifies the user. Standard setup. Now a customer says their payment wasn't handled correctly. You open the trace in CloudWatch. It ends at SQS. Whatever happened on the consumer side — success, failure, partial write — is invisible. You know the event was received. You have no idea what happened next.

SQS doesn't carry trace context. Whatever triggered that message is disconnected from whatever processed it. This is what OpenTelemetry's propagation API is built for. What follows is a practical implementation on AWS Lambda using AWS Distro for OpenTelemetry (ADOT) and CloudWatch Application Signals.

The Problem

In synchronous HTTP-based architectures, tracing just works. When your API calls a downstream service, the SDK automatically injects trace context into the request headers. The downstream service picks it up, and the full call chain appears as one trace.

SQS breaks this. When you call sqs.send_message(), the message goes into a queue. The consumer Lambda runs later — independently, with its own trace ID. There's nothing connecting the two.

Here's what your Application Map looks like without instrumentation:

The consumer Lambda is processing events, writing to DynamoDB — but it's completely invisible to your observability stack. The service map ends at SQS. If a payment event fails processing, you have no idea what happened on the other side of the queue.

If you're new to OpenTelemetry concepts — spans, traces, context propagation — the LFS148 course from the Linux Foundation is worth going through first. It's the foundation this project was built on, and the concepts translate directly to what follows.

The Architecture

For this demo, I built a simplified webhook processing pipeline using AWS CDK:

API Gateway receives the webhook (e.g., a Stripe payment.failed event)
Producer Lambda validates and pushes it to SQS, injecting trace context into message attributes
SQS Queue buffers events with a 5-second batching window
Consumer Lambda pulls from SQS, extracts trace context per message, processes and writes to DynamoDB
ADOT (AWS Distro for OpenTelemetry) layer on both Lambdas handles auto-instrumentation and exports to Application Signals

Stack: Python 3.12, AWS CDK, ADOT Lambda layer (AWSOpenTelemetryDistroPython), CloudWatch Application Signals, X-Ray

The Fix

The fix requires taking matters into your own hands: the producer must explicitly inject the active trace context into the SQS payload, and the consumer has to manually extract it before opening a new span.

Producer: inject trace context

from opentelemetry import propagate

carrier = {}
propagate.inject(carrier)

sqs.send_message(
    QueueUrl=QUEUE_URL,
    MessageBody=json.dumps(payload),
    MessageAttributes={
        'X-Amzn-Trace-Id': {
            'DataType': 'String',
            'StringValue': carrier.get('X-Amzn-Trace-Id', ''),
        },
    },
)

propagate.inject(carrier) writes the current trace context into the carrier dict. With OTEL_PROPAGATORS=xray, the key is X-Amzn-Trace-Id — X-Ray's native format. We put it in the SQS message attributes so it travels with the message.

Consumer: extract and create child span

from opentelemetry import propagate, trace

tracer = trace.get_tracer(__name__)

def process_message(message):
    carrier = {}
    attrs = message.get('messageAttributes', {})
    if 'X-Amzn-Trace-Id' in attrs:
        carrier['X-Amzn-Trace-Id'] = attrs['X-Amzn-Trace-Id']['stringValue']

    ctx = propagate.extract(carrier)

    with tracer.start_as_current_span('process-webhook', context=ctx) as span:
        try:
            span.set_attribute('webhook.event_id', event_id)
            span.set_attribute('webhook.type', event_type)
            # ... process and write to DynamoDB
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

propagate.extract(carrier) reconstructs the trace context from the message attribute. Passing it as context=ctx continues the upstream trace — this span uses the producer span as its parent when the context is valid.

The span.record_exception() + span.set_status(ERROR) calls are easy to miss but important: without them, a failed message closes its span with status UNSET and your error rate metrics stay at zero. The dashboards look fine while messages are silently failing.

The Batching Paradox

SQS delivers up to 10 messages per Lambda invocation. Each message carries a different X-Amzn-Trace-Id from a different upstream producer.

The wrong approach: extract trace context once at the handler level and apply it to all messages. Every consumer span would get linked to the same (wrong) producer trace.

The right approach: extract per message, inside the loop.

def handler(event, context):
    print(f"[batch] size={len(event['Records'])}")
    batch_item_failures = []
    for message in event['Records']:
        try:
            process_message(message)  # each call extracts its own context
        except Exception as e:
            batch_item_failures.append({'itemIdentifier': message['messageId']})
    return {'batchItemFailures': batch_item_failures}

In production, this gets messy quickly. Lambda will spin up multiple concurrent consumers, each processing a different batch. A burst of 8 events might land as two invocations — batches of 5 and 3 — each correctly fanning spans out to their respective producer traces, because the extraction happens per message.

Graceful Degradation

Production is messy. You're going to get messages from legacy producers, manual CLI retries, or uninstrumented upstream teams — all without X-Amzn-Trace-Id. If your consumer assumes the attribute is always there, it'll blow up.

has_trace_context = 'X-Amzn-Trace-Id' in attrs

if has_trace_context:
    carrier['X-Amzn-Trace-Id'] = attrs['X-Amzn-Trace-Id']['stringValue']
else:
    print(f"[trace] no upstream context for message {message['messageId']}")

ctx = propagate.extract(carrier)

with tracer.start_as_current_span('process-webhook', context=ctx) as span:
    span.set_attribute('trace.has_upstream_context', has_trace_context)

When context is missing, propagate.extract({}) returns an empty context and the span becomes a fresh root span. Processing continues normally. The trace.has_upstream_context=false attribute lets you query Application Signals to see how much of your traffic is still coming from uninstrumented sources.

One more edge case worth knowing: if the upstream trace was not sampled, the producer still injects a context, but it's a no-op — the consumer span becomes a root with no upstream link. In practice, this looks identical to graceful degradation. Sampling decisions happen at the producer; the consumer has no way to tell the difference.

The Result

After instrumentation, a single trace shows the full chain from the original HTTP request all the way to the DynamoDB write:

The "This trace is part of a linked set of traces" badge confirms the async boundary was crossed. The process-webhook → DynamoDB path in the same trace as the producer proves the context propagation worked.

Finding a Specific Event

When something goes wrong, you need to find what happened to a specific event fast. Since we set webhook.event_id as a span attribute, CloudWatch Transaction Search can find it directly.

Go to CloudWatch → Application Signals → Transaction Search, switch to Logs Insights QL, and run:

filter @message like "evt_your_event_id"

One result. It shows the span with all context: which service processed it, duration, whether trace context was present, and the traceId to navigate directly to the full waterfall.

CDK Configuration

A few things that aren't obvious from the AWS docs:

The ADOT layer requires a separate IAM policy. Tracing.ACTIVE in CDK only grants X-Ray permissions. Application Signals needs its own managed policy:

const appSignalsPolicy = iam.ManagedPolicy.fromAwsManagedPolicyName(
  'CloudWatchLambdaApplicationSignalsExecutionRolePolicy',
);
producer.role?.addManagedPolicy(appSignalsPolicy);
consumer.role?.addManagedPolicy(appSignalsPolicy);

Environment variables that matter:

environment: {
  AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-instrument',
  OTEL_SERVICE_NAME: 'webhook-producer',
  OTEL_AWS_APPLICATION_SIGNALS_ENABLED: 'true',
  OTEL_PROPAGATORS: 'xray',
  OTEL_METRICS_EXPORTER: 'none',
}

OTEL_PROPAGATORS=xray switches the context format from W3C traceparent to X-Ray's native X-Amzn-Trace-Id format — consistent with what the Lambda runtime and API Gateway already use.

Don't set OTEL_TRACES_EXPORTER=xray. This looks right but causes a RuntimeError: Requested component 'xray' not found at cold start. The ADOT Lambda layer doesn't expose an xray exporter entry point the way you'd expect — Application Signals handles the export path internally. Setting this env var tries to load a component that doesn't exist in this layer.

Key Takeaways

Async boundaries don't preserve trace context in any reliable way. Most AWS async services (SQS, SNS, Kinesis, EventBridge) don't propagate it end-to-end. Assume you need to handle it yourself.
Extract context per message, not per batch. If you extract once at the handler level, every message in the batch links to the same (wrong) producer trace. Individual event traceability breaks down silently.
Messages without trace context will arrive. Legacy producers, manual retries, uninstrumented teams — design for it from the start. Also: if the upstream trace wasn't sampled, the consumer span becomes a root regardless. Sampling decisions happen upstream.
Mark failing spans as errors explicitly. span.record_exception() + span.set_status(ERROR) are what make Application Signals register a processing failure as an error. Without them, error rates look fine while messages are failing.
Silent OTEL misconfiguration is worse than a crash. If OTEL fails to initialize, your Lambda runs normally but exports nothing. The customer sees success — and your observability stack shows a gap you can't explain.

Resources

GitHub repo — full CDK stack, Lambda code, and test scripts
LFS148: Getting Started with OpenTelemetry — the Linux Foundation course this project was built on; good starting point if you're new to OTel concepts
AWS ADOT Lambda layer docs
CloudWatch Application Signals
OpenTelemetry Python propagators
OTel Span Links — the spec-compliant alternative to parent-child for async trace stitching

DEV Community