Fixing observability across async events (like queues)

#webdev #architecture #startup #productivity

Last week we had a deploy on Suga that was taking longer than usual. It was hard to reproduce, so the only obvious thing to do was try and trace it end to end. We immediately ran into roadblocks because the trace only covered the first 140 milliseconds.

For context: the bulk of our deployment code runs in a background worker and is triggered via a queue, and the worker's trace had no connection back to the request that kicked it off.

Without a complete trace, we were forced to look at everything related to our deployments and burnt time combing through the worker looking for oddities. At this point we were stuck trying to build a trace manually by correlating by timestamps in logs - not a lot of fun.

It turns out, a lot of tracing docs spend very little time on the edge cases like handling asynchronous workloads. Instead, they're more foundational, with a focus on synchronous HTTP requests between services which are a little more straight forward.

When one service calls another the headers carry the trace, and you get a trace log that flows like a waterfall which is the exact behavior we wanted for queued workloads.

Queues don't have headers

HTTP tracing works because requests have headers, a sanctioned place to attach arbitrary metadata that both sides know how to read. Queues generally don't, or at least not in the same obvious way.

They do have payloads, which is the data the job works and the first instinct when you hit this problem might be to cram trace context in there. This might work briefly, but its not a good solution because mixing makes it hard to debug and maintain in the future.

The trick is to find whatever metadata channel the queue already gives you and use that and thankfully, almost every mature queue has one (probably because of this scenario). SQS has message attributes, Temporal has context propagators built into the SDK, and Hatchet (which we use to run our workflows) has a metadata field called additionalMetadata.

What Hatchet gave us

additionalMetadata is a Record<string, string>, which turns out to be exactly the right shape for our trace data. It is structured enough to be useful and unstructured enough to carry whatever you want without Hatchet needing to understand it, but more importantly - no database or schema changes.

This is a big deal, because we were retroactively fitting traceability into our workflows, our requirements were to minimize the impact on all of our callers/workers.

Attaching the headers alongside the existing metadata looks like this:

await workflow.runNoWait(input, {
  additionalMetadata: {
    organizationId,
    ...injectTraceContext(),
  },
});

Anatomy of a trace header

The pattern we used leans on a standard called propagation: the way tracing SDKs serialize the current trace into a small set of string headers on one side of a boundary (like an enqueue event) and rehydrate it on the other (dequeue).

We use Sentry to handle the propagation, but, the pattern will work with any tracing SDK that supports an inject/continue pair.

The header names sentry-trace and baggage follow the W3C Trace Context spec, allowing any compliant SDK to work with each other.

A sentry-trace header is three dash-separated fields:

sentry-trace: 771a43a4192642f0b136d5159a501700-7d3efb1b173fecfa-1
              └──────── trace id ────────────┘ └── span id ───┘ └ sampled

The trace id is the identifier every span in this trace shares, which is how the worker-side spans end up stitched to the web request in the UI.
The span id is the parent the next span should hang off, so the worker's first span slots in as a child of whatever span called runNoWait.
The sampled flag tells the receiver whether this trace is being recorded, so both sides agree on whether to keep the data.

OTel calls the equivalent header traceparent and structures it almost identically.

The baggage field is a comma-separated list of key=value pairs carrying trace-level metadata like environment and release.

The handoff

In Sentry the two sides of propagation are getTraceData() on the way in and Sentry.continueTrace(...) on the way out. We've implemented the two helpers below, injectTraceContext and extractTraceContext, as thin adapters around that pair: one takes the SDK's output and hands it to the queue, the other takes the queue's input and hands it to the SDK.

injectTraceContext grabs the current trace's propagation headers, returning an empty object when there is no active trace data.

export function injectTraceContext(): Partial<TraceContext> {
  const data = Sentry.getTraceData();
  if (!data["sentry-trace"]) return {}; // no active trace on this request, nothing to propagate
  const result: Partial<TraceContext> = { "sentry-trace": data["sentry-trace"] };
  if (data.baggage) result.baggage = data.baggage;
  return result;
}

extractTraceContext takes the metadata coming in from the queue, and hands it to Sentry.continueTrace which wraps the callback so any spans created inside it become children of the original request.

export function extractTraceContext<T>(
  metadata: Record<string, string> | undefined,
  callback: () => T,
): T {
  const sentryTrace = metadata?.["sentry-trace"];
  if (!sentryTrace) return callback(); // upstream didn't propagate, run the task in its own trace
  return Sentry.continueTrace(
    { sentryTrace, baggage: metadata?.baggage },
    callback,
  );
}

Once continueTrace has fired, anything the task does after that including database queries or outbound HTTP calls hangs off the web request's trace automatically.

Making it invisible to the task author

From personal experience, developers aren't good at remembering things like tracing when writing new workers, so instead of exposing extractTraceContext directly, we put it alongside our structured logger inside the withTaskLogging wrapper that every task already uses.

return withTaskLogging(ctx, "deployment", "deploy", async ({ event }) => {
  event.set({ deploymentId: input.deploymentId });
  await deployEnvironment(input);
});

Behind the scenes, withTaskLogging handles the extraction before the task body runs:

return extractTraceContext(ctx.additionalMetadata(), () =>
  Sentry.startSpan(
    { name: `task:${workflowName}/${taskName}`, op: "hatchet.task" },
    async () => {
      // task body runs here, as a child of the original web trace
    },
  ),
);

The payoff

We now have a complete trace across synchronous and asynchronous events and the 30.2 seconds the job spent queued shows up as a visible gap rather than something you'd have to infer from timestamps.

POST /api/deploy                          60ms  web
 ├─ db.query: select env                  12ms  web
 └─ workflow.runNoWait: deploy            18ms  web
                                                        30.2s gap (queued)
 task:deployment/deploy                  8.4s   worker
 ├─ db.query: fetch deployment            45ms  worker
 ├─ k8s.apply: deployment                3.2s   worker
 └─ k8s.wait: service loadbalancer       4.8s   worker

By the way, that gap was the entire mystery of the deploy we opened with: the deploy wasn't actually slow; it was waiting in queue to execute and that gave us the information to know where the issue was.