우병수

Posted on Jun 6 • Originally published at techdigestor.com

Debugging Async Job Failures Across Multiple Steps Without Losing Your Mind

#ai #programming #machinelearning #webdev

TL;DR: The thing that bites most teams isn't the failure itself — it's discovering that your job completed steps 1 through 3 perfectly, then died somewhere in step 4, and your logging infrastructure handed you exactly one line: Job failed: undefined method 'id' for nil:NilClass. No c

📖 Reading time: ~39 min

What's in this article

The Actual Problem: Your Job Failed on Step 4, But You Only Know It Failed
Step 1 — Give Every Job Run a Correlation ID Before You Do Anything Else
Step 2 — Structured Logging at Every Step Boundary, Not Just on Failure
Step 3 — Distributed Tracing for Multi-Step Jobs (OpenTelemetry Is Worth the Setup Pain)
Step 4 — Making Retry Logic Work For You, Not Against You
Step 5 — Error Context: Capture What You Need Before the Exception Bubbles Up
Step 6 — Local Debugging Workflow That Doesn't Require Production Access
Step 7 — Dashboards and Alerting That Surface Problems Before Users Report Them

The Actual Problem: Your Job Failed on Step 4, But You Only Know It Failed

The thing that bites most teams isn't the failure itself — it's discovering that your job completed steps 1 through 3 perfectly, then died somewhere in step 4, and your logging infrastructure handed you exactly one line: Job failed: undefined method 'id' for nil:NilClass. No context. No breadcrumbs. No indication of what the job was even processing when it blew up.

Here's the classic setup I see everywhere: a pipeline that goes fetch → transform → persist → notify. Each step depends on the output of the previous one. The fetch step calls an external API and shapes the response. Transform massages it into your internal schema. Persist writes it to Postgres. Notify fires a webhook to a downstream service. When this works, it's elegant. When it fails, you're staring at a stack trace that starts at the persist step with a nil reference error — but the actual bug was in the transform step, which quietly returned nil instead of raising.

# What your logs look like without deliberate step tracking
[2024-01-15 14:23:01] INFO: Starting InvoiceProcessingJob jid=abc123 invoice_id=9981
[2024-01-15 14:23:04] ERROR: Job failed jid=abc123 error="NoMethodError: undefined method 'amount' for nil"

# What you needed but don't have:
# - Which step was executing at 14:23:04
# - What the API returned in the fetch step
# - Whether transform produced a partial result or nil
# - The invoice_id AND the external API's transaction_id

Async failures are qualitatively different from synchronous ones for three reasons that compound on each other. First, there's no continuous stack trace — the frame that enqueued the job is long gone by the time the job runs, so you lose the caller context entirely. Second, jobs run without a request context, meaning your usual correlation IDs from HTTP middleware aren't automatically present. Third — and this is the one that really causes confusion — jobs retry on potentially different workers, different times, and in a different application state than when they first ran.

That last point deserves a concrete example. I had a Sidekiq job that processed order fulfillment. First attempt failed with a Net::ReadTimeout hitting a third-party shipping API — totally understandable, transient. Sidekiq retried it 25 minutes later (default exponential backoff). By then, a background data migration had run and soft-deleted the customer record the job was trying to read. Second attempt failed with ActiveRecord::RecordNotFound. Third retry, same error. By the time someone looked at the dead job queue, the error message was RecordNotFound across all three visible retries, and the original ReadTimeout was nowhere in sight — it had been overwritten in the job's exception history. The developer who investigated reasonably assumed the customer record was always missing, spent two hours in the wrong direction, and never found the timeout that started the whole cascade.

Standard application logging fails here for a structural reason: it's designed around request/response cycles. Your logger writes a line, the request ends, someone reads the log. In an async job, the "request" can span minutes across retries, workers, and application states. What you actually need is step-level state persistence — something that survives across retry boundaries and tells you not just that the job failed, but what each individual step produced, when it ran, and what it received as input. That's a different thing from logging, and conflating the two is exactly why post-mortems on async failures take so long.

Step 1 — Give Every Job Run a Correlation ID Before You Do Anything Else

The single change that made async debugging manageable for me wasn't better logging or fancier tooling — it was making sure every job run has a unique ID that travels with it everywhere. Without this, you're looking at a failure in step 4 of a pipeline and you have zero way to trace it back to the original trigger, the input payload, or what happened in steps 1-3. With it, you grep one ID and the entire execution history falls out.

Why This Is the Highest-ROI Change You'll Make

Most teams already have job IDs somewhere — Sidekiq gives you a jid, BullMQ gives you a job.id. The problem is those IDs die at the job boundary. When job A enqueues job B, job B gets a brand new ID with no link to A. You've now got two disconnected log streams and no way to join them without manually correlating timestamps or input data. A correlation ID (distinct from the job's own ID) is the thread you string through the entire chain — from the original HTTP request or webhook, through every queued step, to the final side effect.

Ruby/Sidekiq: Middleware Is the Right Injection Point

Don't put this in your job base class. Use a server middleware so it runs for every job automatically, even third-party ones. Here's what I actually run:

class CorrelationIdMiddleware
  def call(worker, job, queue)
    # job['correlation_id'] was set when the job was enqueued.
    # Fall back to jid only if nothing was passed — this is the entry point case.
    correlation_id = job['correlation_id'] || job['jid']

    Thread.current[:correlation_id] = correlation_id
    Thread.current[:parent_job_id]  = job['parent_job_id']

    Rails.logger.tagged("cid=#{correlation_id}") { yield }
  ensure
    # Always clear — Sidekiq reuses threads.
    Thread.current[:correlation_id] = nil
    Thread.current[:parent_job_id]  = nil
  end
end

# In your initializer:
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.prepend CorrelationIdMiddleware
  end
end

The ensure block is critical. Sidekiq uses a thread pool, and if you don't clear thread-local state, a future job on the same thread picks up stale IDs. I've seen this cause phantom correlations that took hours to untangle.

Node.js/BullMQ: Thread Locals Don't Exist, So Use the Payload

Node doesn't have thread-local storage the way Ruby does, but AsyncLocalStorage from Node's async_hooks module gives you the same semantics. For BullMQ specifically, I also pass correlationId directly in job.data as a belt-and-suspenders approach — it means the ID survives serialization to Redis and is always inspectable without running code:

import { Worker, Queue } from 'bullmq';
import { AsyncLocalStorage } from 'async_hooks';

export const correlationStore = new AsyncLocalStorage<string>();

const worker = new Worker('pipeline', async (job) => {
  const cid = job.data.correlationId ?? job.id;

  // Wrap the entire processor so every async call inside inherits the context
  return correlationStore.run(cid, async () => {
    logger.info({ cid, step: job.name }, 'job started');
    await processStep(job.data);
  });
}, { connection });

// When reading from anywhere inside processStep:
const cid = correlationStore.getStore(); // always available, no prop-drilling

The Gotcha That Burns Everyone: Child Jobs Don't Inherit Anything

This is the part that isn't in any README. When you enqueue a child job from inside a running job, the new job starts with a completely blank slate. The correlation ID does not carry over — not through Sidekiq's jid, not through BullMQ's job lineage, nothing. You have to pass it explicitly every single time:

# Ruby — inside a parent job
def perform(order_id, correlation_id:)
  # ... do work ...
  FulfillmentJob.perform_async(
    order_id,
    # Explicitly forward — if you skip this, the child job is an orphan in your logs
    correlation_id: correlation_id,
    parent_job_id: jid
  )
end

// BullMQ — inside a parent processor
await fulfillmentQueue.add('fulfill', {
  orderId: job.data.orderId,
  correlationId: job.data.correlationId, // must be explicit
  parentJobId: job.id,
});

I'd go further: make correlationId a required field in your job schema and let your queue helper throw if it's missing. That eliminates the class of bug where someone adds a new job enqueue deep in the call stack and forgets to thread the ID through.

Where to Emit the ID

The correlation ID is only useful if it shows up everywhere a failure might surface. My checklist:

Structured logs: every log line, not just the start/end lines — use Rails.logger.tagged or a Pino child logger with { cid } bound at job start
Error reporters: Sentry.set_tag('correlation_id', cid) before any work runs — this makes Sentry issues groupable and searchable by ID
Outbound HTTP: set X-Correlation-ID on every external API call the job makes — when a vendor's support team needs to trace a request, you'll thank yourself
Webhook payloads: if the job sends a webhook on completion or failure, include correlation_id in the body — it lets your customers trace their side of the interaction back to your logs

Step 2 — Structured Logging at Every Step Boundary, Not Just on Failure

Structured Logging at Every Step Boundary, Not Just on Failure

The thing that breaks people first isn't a lack of logs — it's that they have plenty of logs but can't reconstruct what actually happened. You have five workers processing jobs concurrently, each spitting out plain-text lines, and now you're trying to grep through 200,000 lines where Processing step 2 appears from 40 different jobs interleaved with each other. Plain text fails you the moment concurrency enters the picture. JSON logs let you filter on job_id instantly, hand them to any log aggregator (Datadog, Loki, CloudWatch) and get a coherent timeline per job without writing regex incantations.

The minimum viable log line isn't just a message and a timestamp. After debugging enough job failures in production, I landed on this field set as the floor, not the ceiling:

timestamp — ISO 8601 with milliseconds, not epoch. Humans read this.
job_id — the queue's internal ID, stable across retries
correlation_id — the ID that ties this job to the upstream HTTP request or event that triggered it. This is the field that saves you at 2am.
step_name — e.g. validate_payload, charge_card, send_confirmation
duration_ms — logged on step exit. If this step usually takes 80ms and now it's 4000ms, that's your signal.
status — started, completed, failed. Not true/false.

For Sidekiq with semantic_logger, here's the initializer I actually use. The key parts are forcing JSON format in non-dev environments and tagging every log with job context automatically:

# config/initializers/semantic_logger.rb
SemanticLogger.default_level = :info

if Rails.env.production? || Rails.env.staging?
  # JSON appender — pairs with Datadog or any log drain
  SemanticLogger.add_appender(
    io: $stdout,
    formatter: :json
  )
else
  SemanticLogger.add_appender(
    io: $stdout,
    formatter: :color  # readable locally, never in prod
  )
end

# In your base worker class:
module ApplicationWorker
  include Sidekiq::Worker

  def perform(*args)
    # semantic_logger picks this up and stamps every log line
    logger.tagged(
      job_id: jid,
      correlation_id: args.first&.dig("correlation_id") || SecureRandom.uuid,
      worker: self.class.name
    ) do
      run(*args)
    end
  end
end

BullMQ workers using pino need two specific options that most tutorials get wrong. First, prettyPrint: false in production — pretty printing parses and re-serializes each line, adds latency, and breaks every log aggregator that expects newline-delimited JSON. Second, set level: 'info' explicitly and bind your pino instance at worker startup, not inside the job handler (otherwise you lose the parent context):

// worker.js
import { Worker } from 'bullmq';
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  // prettyPrint is deprecated in pino v7+, use transport only in dev
  transport: process.env.NODE_ENV !== 'production'
    ? { target: 'pino-pretty' }
    : undefined,
  base: { service: 'job-worker', env: process.env.NODE_ENV },
});

const worker = new Worker('email-queue', async (job) => {
  const log = logger.child({
    job_id: job.id,
    correlation_id: job.data.correlationId,
    job_name: job.name,
  });

  const stepStart = Date.now();
  log.info({ step_name: 'validate_payload', status: 'started' }, 'step entry');

  try {
    await validatePayload(job.data);
    log.info({
      step_name: 'validate_payload',
      status: 'completed',
      duration_ms: Date.now() - stepStart,
    }, 'step exit');
  } catch (err) {
    log.error({
      step_name: 'validate_payload',
      status: 'failed',
      duration_ms: Date.now() - stepStart,
      err,
    }, 'step failed');
    throw err; // let BullMQ handle retry
  }
}, { connection: redisConnection });

Log step entry and exit — not just errors. I cannot stress this enough. An error log tells you a step failed. An entry log tells you a step started. An exit log with duration_ms tells you it finished normally and how long it took. Without all three, you can't distinguish between "step never started" and "step started but hung" and "step completed but the next one crashed". Those are three completely different problems requiring three different fixes. The timeline you reconstruct from logs is only as complete as the events you actually emitted.

Most teams I've seen get this backwards: they log INFO every time a small helper function is called (noise), but they don't log at step transitions (signal). You end up with 50 log lines per job that tell you nothing about the sequence of major operations, and zero lines that tell you which step was active when memory spiked. The discipline is: INFO at every step boundary, DEBUG for internal details inside a step, ERROR only when you're actually catching something. If your job produces more than 8–10 INFO lines for a normal successful run, you've drifted into spam territory and you'll start ignoring the logs entirely — which defeats the whole point.

Step 3 — Distributed Tracing for Multi-Step Jobs (OpenTelemetry Is Worth the Setup Pain)

The moment you have six or more steps in a job pipeline — especially when some of those steps fan out into parallel child jobs — logs become an archaeological dig. You're correlating timestamps across three different services, mentally reconstructing what ran before what, and hoping nobody deployed between the job being enqueued and it finally failing at step 4. I've been there, and the fix isn't better log formatting. It's traces.

The core idea with OpenTelemetry in async jobs: create a root span the moment you enqueue the job, not when it starts running. This is the part most tutorials get wrong. If you create the span inside the worker, you lose the enqueue-to-execution gap entirely — which is often where the real problem is hiding. Here's the pattern I use with a Node.js + BullMQ setup:

import { trace, context } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

const tracer = trace.getTracer('job-scheduler');
const propagator = new W3CTraceContextPropagator();

async function enqueueProcessOrderJob(orderId: string) {
  const span = tracer.startSpan('process-order.enqueue');
  const carrier: Record<string, string> = {};

  // Inject traceparent into carrier so the worker can re-attach
  propagator.inject(
    trace.setSpan(context.active(), span),
    carrier,
    { set: (c, k, v) => { c[k] = v; } }
  );

  await orderQueue.add('process-order', {
    orderId,
    _traceContext: carrier, // W3C traceparent travels with the job args
  });

  span.end();
}

Then on the worker side, you extract that context and re-attach before creating child spans for each step:

import { propagation, context, trace } from '@opentelemetry/api';

worker.process(async (job) => {
  const parentCtx = propagation.extract(
    context.active(),
    job.data._traceContext ?? {}
  );

  const tracer = trace.getTracer('order-worker');

  // All child spans now link back to the root enqueue span
  return context.with(parentCtx, async () => {
    const validateSpan = tracer.startSpan('process-order.validate');
    await validateOrder(job.data.orderId);
    validateSpan.end();

    const chargeSpan = tracer.startSpan('process-order.charge');
    await chargeCustomer(job.data.orderId);
    chargeSpan.end();
  });
});

The SDK config for local dev is straightforward — point it at a Jaeger instance running in Docker. Jaeger's all-in-one image works fine and requires zero Kubernetes:

# docker-compose.yml snippet
jaeger:
  image: jaegertracing/all-in-one:1.57
  ports:
    - "16686:16686"  # UI
    - "4317:4317"    # OTLP gRPC receiver
  environment:
    COLLECTOR_OTLP_ENABLED: "true"

// tracing-init.ts — call this before anything else in your worker process
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4317',
  }),
  serviceName: 'order-worker',
});

sdk.start();

For backends beyond local Jaeger, the trade-offs are real. Jaeger self-hosted costs you ops time — storage configuration, retention policies, and scaling the collector when trace volume spikes. Good for teams that already run their own infra. Honeycomb has a free tier capped at 20 million events/month, and their query interface for finding traces by arbitrary field values is genuinely better than Jaeger's. Grafana Tempo makes sense if you're already deep in the Grafana stack — it's cheap to run (object storage backend like S3), but querying it without Grafana Loki for correlated logs feels incomplete. My honest take: use Jaeger locally, Honeycomb in production until your volume forces a conversation about cost.

The platform-level difference that surprised me: Temporal has first-class OpenTelemetry support baked in. You configure an interceptor and every activity execution automatically creates spans with workflow and activity names as attributes. With Sidekiq, you're writing every span by hand — there's no official OTEL integration as of Sidekiq 7.x, so you end up wrapping perform methods yourself or using a community middleware. That's not a dealbreaker, but budget a couple of hours to wire it up properly and test that context actually propagates across retries. Speaking of tooling that accelerates this kind of instrumentation work — including AI-assisted approaches for generating boilerplate tracing code — the Best AI Coding Tools in 2026 guide covers options worth knowing about.

Step 4 — Making Retry Logic Work For You, Not Against You

The thing that bites most teams isn't that retries happen — it's that they didn't think through what "retry from the beginning" actually means when three steps already ran. I've seen jobs send the same Stripe charge three times because someone assumed the framework was smarter than it is. It's not. You have to make it smart.

Default Retry Behavior: Where Each Framework Will Surprise You

Sidekiq retries failed jobs up to 25 times by default, with exponential backoff that maxes out around 21 days. The surprise: it retries the entire job from the top of perform, every time. No state is preserved between attempts. BullMQ's default is 0 retries — your job fails once and it's gone unless you explicitly configure attempts. I've watched teams run BullMQ in production for months thinking their jobs were succeeding because nothing was erroring loudly. They were just silently vanishing. Temporal is the odd one out: it retries at the activity level, not the workflow level, which means a workflow with five activities will only re-execute the failed one. That's the model you actually want for multi-step jobs, and it's why Temporal's retry semantics feel dramatically different from queue-based systems once you've worked with both.

Idempotency Is Not Optional — Here's What Actually Breaks

Step 3 runs, succeeds, commits to your database. Step 4 throws. The job retries. Step 3 runs again. If step 3 is "create a user record" or "charge a card," you now have a duplicate. The fix isn't clever error handling — it's designing every step to be safe to run twice. That means upserts instead of inserts, idempotency keys on payment APIs, and checking state before writing it. Stripe's API accepts an idempotency_key header exactly for this reason. Pass a deterministic key based on your job ID and step number, and a retried charge becomes a no-op instead of a double charge.

# Idempotency key scoped to job + step
idempotency_key = "job_#{job_id}_step_charge_user"

Stripe::Charge.create(
  { amount: 5000, currency: "usd", customer: customer_id },
  { idempotency_key: idempotency_key }
)

Checkpointing With Redis So Retries Resume, Not Restart

Instead of re-running every step on retry, record which steps finished. Redis HSET is perfect here — it's fast, atomic, and you can expire the key after 24 hours so you're not leaking state forever. Before each step runs, check the hash. If the field is already set, skip and move on.

# Before each step
def step_completed?(job_id, step_name)
  $redis.hget("job:#{job_id}:steps", step_name) == "done"
end

# After each step succeeds
def mark_step_complete(job_id, step_name)
  $redis.hset("job:#{job_id}:steps", step_name, "done")
  $redis.expire("job:#{job_id}:steps", 86400) # 24hr TTL
end

# Inside your job
def perform(user_id)
  unless step_completed?(job_id, "create_account")
    create_account(user_id)
    mark_step_complete(job_id, "create_account")
  end

  unless step_completed?(job_id, "send_welcome_email")
    send_welcome_email(user_id)
    mark_step_complete(job_id, "send_welcome_email")
  end
end

The catch: job_id in Sidekiq is jid, available on the job instance. In BullMQ it's job.id. Make sure you're passing a stable identifier — if your job framework generates a new ID on each retry (some do), this whole pattern breaks. Verify that first.

Exponential Backoff With Jitter — The Jitter Part Is Not Optional

Exponential backoff without jitter causes thundering herd. Imagine 500 jobs fail simultaneously because a downstream API returned 503. Pure exponential means they all retry at T+2s, then T+4s, then T+8s — in lockstep. You're sending synchronized bursts to an already-struggling service. Jitter breaks up that synchronization by adding randomness. The formula I use is base_delay * 2^attempt + rand(0..base_delay). That spreads 500 concurrent retries across a window instead of hammering the same second.

# BullMQ backoff config with jitter-style spreading
const queue = new Queue("sync-jobs", {
  defaultJobOptions: {
    attempts: 8,
    backoff: {
      type: "exponential",
      delay: 1000, // 1s base, doubles each attempt
    },
  },
});

// BullMQ's built-in exponential doesn't add jitter natively.
// Roll your own by using a custom backoff strategy:
const worker = new Worker("sync-jobs", processor, {
  settings: {
    backoffStrategy: (attemptsMade) => {
      const base = 1000 * Math.pow(2, attemptsMade);
      const jitter = Math.random() * 1000;
      return Math.min(base + jitter, 30000); // cap at 30s
    },
  },
});

Dead Letter Queues: BullMQ's Default Will Silently Eat Your Failures

BullMQ defaults to removeOnFail: true, which means once a job exhausts its retry attempts, it disappears. No trace. No alert. You find out something's broken when a customer emails you. Fix this immediately by setting removeOnFail: false when adding jobs — this keeps failed jobs in Redis so you can inspect them.

await queue.add("process-payment", payload, {
  attempts: 5,
  removeOnFail: false, // keeps the job in the "failed" set
  backoff: { type: "exponential", delay: 2000 },
});

// Later, query failed jobs for alerting or manual retry
const failedJobs = await queue.getFailed(0, 100);

On the Sidekiq side, jobs that exhaust retries land in the Dead Job queue automatically — no config needed. But the default max queue size is unbounded, so in a failure storm you can fill Redis. Set dead_max_jobs and dead_timeout_in_seconds in your Sidekiq config explicitly. The defaults are 10,000 jobs and 6 months respectively, which is probably more than you want sitting in Redis.

Sidekiq Pro's Reliability Mode: When to Pay For It

Sidekiq Pro's reliability mode ($179/month as of 2024, single app license) solves a specific problem: if your Ruby process dies mid-job — OOM kill, deploy, hardware failure — the job is lost. Standard Sidekiq uses BRPOPLPUSH under the hood but doesn't guarantee re-queue on crash. Reliability mode uses a two-queue approach that survives process death. Whether that's worth paying for depends entirely on your job type. If your jobs are idempotent and short (under 30 seconds), just let Sidekiq's standard retry handle re-queuing after your next deploy. If you're running 10-minute jobs that move money, pay for it. Rolling your own reliable fetch with Lua scripts and Redis transactions is doable, but you'll spend a week getting it right and another week debugging edge cases. I've done it. Pay the $179.

Step 5 — Error Context: Capture What You Need Before the Exception Bubbles Up

The frustrating thing about async job failures in Sentry isn't that the error shows up — it's that when it does, you're staring at a stack trace with zero context about which user triggered it, what arguments the job was running with, or which step in a multi-stage pipeline actually blew up. By the time the exception bubbles to your error handler, the context that would've made this a 2-minute fix is gone. You're left guessing.

The fix is dead simple but the ordering matters more than most docs admit: set your scope before the try/rescue block, not inside the rescue. If you put context-setting inside the rescue, you're still flying blind for errors that don't raise — timeouts that swallow exceptions, external services that return 200 with an error body, that kind of thing. Attach the context at job entry, unconditionally.

Here's what that looks like in Ruby with Sidekiq:

class ProcessPaymentJob
  include Sidekiq::Job

  def perform(user_id, order_id, step_name)
    # Set scope BEFORE any business logic touches this job
    Sentry.configure_scope do |scope|
      scope.set_user(id: user_id)
      scope.set_tags(
        job_class: self.class.name,
        step_name: step_name,
        correlation_id: Thread.current[:correlation_id]
      )
      scope.set_context("job_args", {
        order_id: order_id,
        step_name: step_name
        # deliberately omitting: api_key, card_token, raw_email
      })
    end

    # Now do the actual work
    run_step(step_name, user_id, order_id)
  end
end

The TypeScript/BullMQ equivalent uses Sentry.withScope, which takes a callback so the scope is properly isolated per-worker invocation — important when you're running concurrent workers in the same process:

import * as Sentry from '@sentry/node';

worker.process(async (job) => {
  await Sentry.withScope(async (scope) => {
    scope.setUser({ id: job.data.userId });
    scope.setTag('job_name', job.name);
    scope.setTag('step_name', job.data.stepName);
    scope.setTag('correlation_id', job.data.correlationId);
    scope.setContext('job_args', {
      orderId: job.data.orderId,
      stepName: job.data.stepName,
      // strip anything sensitive before it reaches here
    });

    await runStep(job.data);
  });
});

On redacting secrets: don't get clever here. Maintain an explicit allowlist of fields you'll log, not a blocklist of fields you'll strip. Blocklists rot — someone adds stripe_secret to the job payload six months later and forgets to add it to the blocklist. An allowlist means { order_id, step_name, retry_count } is all that ever hits your error tracker, full stop. This matters especially for GDPR and SOC 2 — your error tracker vendor stores this data, and you're responsible for what's in it.

The thing the docs bury: Sentry ships first-party integrations for both Sidekiq (sentry-sidekiq gem) and BullMQ (@sentry/node with the Bull integration). These auto-capture job class, queue name, and retry count without any manual instrumentation. I skipped these for a while because I didn't realize they existed and wrote a bunch of middleware by hand. Don't do that. Install sentry-sidekiq, add require 'sentry-sidekiq' after your Sentry init, and you immediately get structured job breadcrumbs and automatic scope cleanup between jobs. For BullMQ, you still need withScope for per-job user context, but the base integration handles queue-level metadata automatically. Use both together — the integration gives you the free stuff, your manual scope adds the business context only you know about.

Step 6 — Local Debugging Workflow That Doesn't Require Production Access

Local Debugging Without Touching Production

The most painful debugging sessions I've had were ones where I had to reproduce a failure by guessing — because the local environment was too stripped-down to replay anything meaningfully. The fix isn't "just add more logging in prod." The fix is making your local environment close enough to production that you can replay the exact failing job with the exact args it had when it broke. This section is about that.

Replaying a Failed Job From Its Stored Args

Sidekiq stores failed jobs in the dead set with full argument payloads. From the Rails console, you can pull a failed job and re-enqueue it directly against your local worker:

# Connect to your local Redis (or tunnel to staging)
require 'sidekiq/api'

dead = Sidekiq::DeadSet.new
job = dead.find_job('abc123deadjobid')  # grab the jid from Sidekiq Web UI

# Inspect what args it had
puts job.args  # => ["order_id", 4521, { "retry_count" => 2 }]

# Re-enqueue it to run locally right now
job.retry  # or manually:
MyWorker.perform_async(*job.args)

For BullMQ, the CLI equivalent is cleaner. Install bull-repl globally and point it at your local Redis:

npm install -g bull-repl
bull-repl

# Inside the REPL:
connect redis://localhost:6379
use queue myJobQueue

# List failed jobs
ls failed

# Get the specific job data
show 42

# Re-run it
retry 42

The thing that catches people out: if your job does external HTTP calls, you'll want a local WireMock or msw server recording those responses in staging first, then replaying them locally. Otherwise you're not debugging the failure — you're debugging a completely different execution path.

Temporal's Workflow Replay Is Genuinely Underused

Temporal saves complete event histories for every workflow execution. This means you can take the event history from a failed production workflow and replay it locally, deterministically, against your current code. I cannot overstate how useful this is for debugging multi-step job failures where the bug is in step 4 but the side effects from steps 1–3 already happened.

# Pull the event history from your Temporal server (or Temporal Cloud)
temporal workflow show \
  --workflow-id "order-fulfillment-4521" \
  --namespace production \
  --output json > /tmp/failed_workflow.json

# Replay it locally against your worker
temporal workflow replay \
  --workflow-id "order-fulfillment-4521" \
  --event-file /tmp/failed_workflow.json

Your local worker processes the event history without re-executing any activity code — it just validates that your workflow code produces the same decisions as it did originally. If your recent code changes would have caused a non-determinism error (a common Temporal gotcha), the replay catches it immediately. The Go SDK's worker.ReplayWorkflowHistory function does the same thing programmatically in tests, which is even better.

Docker Compose Setup That Actually Mirrors Reality

Running just Redis locally and expecting to catch distributed job failures is wishful thinking. You need the full stack: Redis, the worker process, the job dashboard, and a trace collector. This docker-compose.yml gets you Sidekiq Web UI, Redis, and Jaeger in under 10 minutes:

version: '3.8'

services:
  redis:
    image: redis:7.2-alpine
    ports:
      - "6379:6379"

  worker:
    build: .
    command: bundle exec sidekiq -c 2 -q default -q critical
    environment:
      REDIS_URL: redis://redis:6379/0
      OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4318
    depends_on:
      - redis
    volumes:
      - .:/app  # live code reload during debugging

  sidekiq_web:
    build: .
    # Rack app that mounts Sidekiq::Web — no separate gem needed
    command: bundle exec rackup sidekiq_web.ru -p 9292 -o 0.0.0.0
    ports:
      - "9292:9292"
    environment:
      REDIS_URL: redis://redis:6379/0
    depends_on:
      - redis

  jaeger:
    image: jaegertracing/all-in-one:1.57
    ports:
      - "16686:16686"   # UI
      - "4318:4318"     # OTLP HTTP receiver

The sidekiq_web.ru file is three lines:

# sidekiq_web.ru
require 'sidekiq/web'
Sidekiq::Web.use Rack::Session::Cookie, secret: 'local_debug_secret'
run Sidekiq::Web

After docker compose up, hit http://localhost:9292 for the job dashboard and http://localhost:16686 for traces. Having both open side-by-side while triggering a failure is the fastest way I know to correlate "job failed at step X" with "span showed 5s timeout on the DB call."

Injecting a Failure at a Specific Step Without Mocking the Whole Job

Full mocking wipes out the realistic execution path. Instead, use a targeted fault injector that only blows up at one step. I do this with a simple module that reads from environment or a thread-local flag:

# lib/fault_injector.rb — only loaded in test/dev
module FaultInjector
  FAULTS = {}

  def self.register(step_name, error_class: RuntimeError, message: "Injected failure")
    FAULTS[step_name.to_sym] = { error: error_class, message: message }
  end

  def self.check!(step_name)
    if (fault = FAULTS[step_name.to_sym])
      raise fault[:error], fault[:message]
    end
  end
end

# In your job step:
def process_payment(order)
  FaultInjector.check!(:process_payment)  # no-op in prod; never load this file there
  PaymentGateway.charge(order)
end

# In your test or local console:
FaultInjector.register(:process_payment, error_class: PaymentGateway::TimeoutError)
MyWorker.perform_async(order_id: 4521)

This approach lets you test exactly what happens when step 3 times out after steps 1 and 2 have already committed — without touching the logic of the other steps. The key constraint: never require fault_injector.rb in production. Gate it with if Rails.env.local? or an explicit require in your test helper. I've seen people accidentally ship fault injection code to prod and spend a confusing hour wondering why jobs were randomly dying.

Step 7 — Dashboards and Alerting That Surface Problems Before Users Report Them

The Metrics That Actually Tell You Something Is Wrong

Most teams instrument their job systems and then watch the wrong numbers. CPU usage and memory are fine but they're trailing indicators — you already have a problem by the time those spike. The four metrics I've found worth building dashboards around are: job duration by step (not aggregate duration — per step, so you can see which stage is slow), failure rate per queue (not global failure rate, which masks a single bad queue drowning everything else), retry count distribution (a job being retried 4 times before succeeding is almost as bad as one that fails — you're just not seeing the pain), and dead letter queue depth. That last one should be zero. Always. An alert on DLQ depth > 0 is the single highest-signal alert you can set up for async job infrastructure.

Exporting Sidekiq Metrics to Prometheus

The sidekiq-prometheus-exporter gem gives you a /metrics endpoint that Prometheus can scrape. Setup is about 10 minutes:

# Gemfile
gem 'sidekiq-prometheus-exporter', '~> 0.1'

# config/initializers/sidekiq.rb
require 'sidekiq/prometheus/exporter'

Sidekiq::Web.use(Sidekiq::Prometheus::Exporter)

Mount Sidekiq::Web in your routes if you haven't already, then point Prometheus at https://yourapp.com/sidekiq/metrics. The exporter surfaces queue latency, processed count, failed count, and retry set size out of the box. What it doesn't give you without custom instrumentation is per-step duration inside a single job — for that you need to manually call ActiveSupport::Notifications or a StatsD client at each step boundary. The gem is solid for queue-level health; it won't help you figure out which of your 6 processing steps is eating 80% of job time.

BullMQ + bull-board: Good Ops Dashboard, Weak Debugging Tool

If you're on Node with BullMQ, bull-board spins up a web dashboard in about five minutes:

npm install @bull-board/express @bull-board/api bullmq

// server.js
import { createBullBoard } from '@bull-board/api';
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter';
import { ExpressAdapter } from '@bull-board/express';
import { Queue } from 'bullmq';

const serverAdapter = new ExpressAdapter();
serverAdapter.setBasePath('/admin/queues');

createBullBoard({
  queues: [
    new BullMQAdapter(new Queue('email')),
    new BullMQAdapter(new Queue('image-processing')),
  ],
  serverAdapter,
});

app.use('/admin/queues', serverAdapter.getRouter());

You get queue depth, active/waiting/failed job counts, and the ability to retry failed jobs manually. The honest limitation: bull-board is an ops tool. It tells you that something failed and lets you poke at individual job payloads, but it has no concept of multi-step job flows, no timeline view, and no way to correlate a failure in job B with the output of job A that triggered it. For a team that just needs "show me what's stuck in the queue" at 2am, it's perfect. For debugging why a 7-step pipeline occasionally produces corrupt output, it's nearly useless.

The DLQ Alert You Need Before Your Users Become Your Monitoring System

The alert that's saved me more times than anything else is dead letter queue depth greater than zero. By the time a user files a ticket saying "my emails aren't sending," you've likely had jobs silently exhausting retries and landing in the DLQ for 30+ minutes. Here's a Prometheus alerting rule that pages immediately:

# prometheus/alerts.yml
groups:
  - name: sidekiq
    rules:
      - alert: DeadLetterQueueNotEmpty
        expr: sidekiq_dead_size > 0
        for: 1m   # don't alert on a transient single failure, but 1 min is plenty
        labels:
          severity: critical
        annotations:
          summary: "Jobs in dead letter queue ({{ $value }} jobs)"
          description: "Sidekiq dead queue has {{ $value }} jobs. Check /sidekiq for details."

The for: 1m buffer prevents a single flapping job from waking you up, but anything sitting dead for over a minute is a real problem. For BullMQ, query Redis directly: LLEN bull:email:failed — you can export this with a custom Prometheus exporter or just a cron that pushes a gauge. The point is that DLQ depth should never be a background number you review weekly; it should demand immediate attention every single time.

Temporal's UI vs tctl — Know Which One to Reach For

Temporal ships with a web UI that renders the full event history of any workflow execution as a timeline. When I'm investigating a failure I can see exactly which activity failed, what input it received, what exception it threw, and how many times it was retried — all without touching a terminal. For most debugging sessions, especially when you're sharing context with someone who isn't deep in the CLI, the UI wins.

Where tctl beats the UI is bulk operations and scripting. Searching for all failed workflows of a specific type in a time range:

# tctl v1.x syntax — note: tctl is being replaced by 'temporal' CLI in newer versions
tctl --namespace default workflow list \
  --query 'WorkflowType="OrderFulfillment" AND ExecutionStatus="Failed"' \
  --fields WorkflowId,StartTime,CloseTime,ExecutionStatus

# Inspect full event history for a specific run
tctl --namespace default workflow showid \
  --workflow_id order-8821 \
  --run_id abc123-...

The UI is slow when a workflow has thousands of events — I've seen it hang on workflows with 5,000+ history events (Temporal's hard limit is 50,000 events per workflow execution, and you do not want to hit that). tctl pages through those without flinching. My rule: use the UI to understand a single failure, use the CLI to find patterns across failures or automate remediation.

When to Pick Temporal vs Sidekiq vs BullMQ for Multi-Step Jobs

The most common mistake I see teams make is reaching for Temporal because it sounds like the right solution for complex workflows, then spending two months on infrastructure before their first job runs in production. Picking the wrong orchestrator doesn't just cost you engineering time — it shapes how you debug failures for the lifetime of the system. So here's my honest take after running all three in production.

Temporal: Worth It, But Earn It First

Temporal earns its complexity when your workflow has branching logic that depends on external state, human approval steps that might wait 72 hours, or retries that need to survive a full server restart. The debugging experience genuinely is better — you get a full event history for every workflow execution, you can see exactly which activity failed and what the input was, and you can replay executions locally against production history. That last one is a superpower for debugging multi-step failures. But you're also operating a Temporal cluster (or paying for Temporal Cloud at $0.025 per 10K action executions plus storage), writing workers in their SDK idiom, and teaching your team a new mental model for how code executes.

# Temporal Cloud pricing check — actions add up fast in loops
# A workflow that polls every 30s for 24h = ~2,880 actions
# At $0.025/10K = nearly free, but signal-heavy workflows with many activities hit differently
temporal workflow show --workflow-id order-approval-xyz --namespace prod

The gotcha nobody warns you about: Temporal's determinism requirement will break you the first time you put a Math.random() or Date.now() directly in a workflow definition. The replayer will produce different results than the original run and your workflow will deadlock in confusing ways. All non-deterministic calls belong in Activities, not Workflows — this isn't obvious from the quick-start docs.

Sidekiq: The Boring Right Answer for Ruby

If your team is already in Rails, Sidekiq Pro ($179/month for 6 workers, unlimited after that) gets you batches, reliable scheduling, and a Web UI that non-engineers can actually read. The ecosystem around it — sidekiq-unique-jobs, sidekiq-failures, structured logging via Sidekiq::Middleware — covers most of what people bolt Temporal onto. The one thing you must add yourself is a correlation ID that flows from enqueue through every retry:

# config/initializers/sidekiq.rb
Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add CorrelationIdMiddleware  # stamps job_id + trace_id on every job
  end
end

# In your worker — log every step with this context
def perform(order_id)
  logger.info({ event: "payment.started", order_id:, trace_id: jid })
  PaymentService.charge(order_id)
  logger.info({ event: "payment.succeeded", order_id:, trace_id: jid })
end

Multi-step orchestration in Sidekiq means chaining jobs — each job enqueues the next on success, and you handle failures per-step with sidekiq_retries_exhausted. It's not as elegant as Temporal's event history, but if you ship structured logs to Datadog or Loki and filter by trace_id, you reconstruct the full execution path just fine. The limitation shows when you need step-level rollback or conditional branching more than 2 levels deep — that's when the job-chain pattern turns into spaghetti.

BullMQ: The Node Default With Real Redis Gotchas

BullMQ is the correct starting point for any Node.js shop. The API is clean, the queue/worker separation maps well to microservices, and the built-in job progress tracking helps when debugging long-running steps. Where teams get burned is Redis connection management under load. BullMQ opens a dedicated blocking connection per queue for the BLPOP listener — this means if you have 10 queues and 5 worker processes, you're at 50 persistent Redis connections before you've handled a single job. On Redis Cloud's free tier (30MB, 30 connections limit), that ceiling arrives fast.

// bullmq worker — always set connection limits explicitly
import { Worker } from 'bullmq';
import { createClient } from 'redis';

const connection = createClient({
  url: process.env.REDIS_URL,
  socket: {
    reconnectStrategy: (retries) => Math.min(retries * 100, 3000), // don't hammer Redis on reconnect
  },
});

const worker = new Worker('invoice-pipeline', async (job) => {
  // job.log() writes to BullMQ's built-in job log — queryable from Bull Board
  await job.log(`Step 1 started: fetching invoice ${job.data.invoiceId}`);
  // ...
}, {
  connection,
  concurrency: 5, // tune this per worker, not globally
});

The other gotcha: when a BullMQ worker crashes mid-job, the job enters a stalled state after the stalledInterval (default 30 seconds). If you're debugging why a job "ran but nothing happened," check stalled jobs first — they don't appear as failed, so your error alerting won't fire. Use queue.getStalled() or just install Bull Board and look at the UI. Bull Board is free and takes about 20 minutes to wire in; there's no reason to run BullMQ without it.

The Honest Middle Ground

For most product teams — not infrastructure teams building platforms — Sidekiq or BullMQ with structured logging, a correlation ID, and a tracing integration like OpenTelemetry gets you 90% of Temporal's debuggability at a fraction of the operational cost. The 10% you give up is durable execution across multi-day waits and built-in workflow visualization. If your jobs complete in under an hour and don't branch on human input, you don't need Temporal yet. Start simple, add observability aggressively, and you'll know when you've actually outgrown it.

Feature

Temporal

Sidekiq Pro

BullMQ

Step orchestration

First-class (Workflow + Activity model)

Manual job chaining

Built-in UI

Temporal Web (needs separate deploy)

Sidekiq Web (Rack middleware, 5 min)

Bull Board (Express middleware, 20 min)

Tracing support

Native OTel via SDK interceptors

Manual middleware + OTel gem

Manual middleware + OTel SDK

Language ecosystem

Go, Java, Python, TypeScript, PHP (unofficial)

Ruby only

Node.js / TypeScript only

Self-host complexity

High (Temporal server + DB + Elasticsearch optional)

Low (Redis + your app)

The Debugging Checklist I Actually Run Through When a Job Fails

The thing that saves me the most time isn't a clever tool — it's the order I do things. Most devs jump straight to the code when a job fails. That's usually wrong. The failure is almost never where you think it is, and if you start there, you'll spend 45 minutes reading perfectly fine code while the real culprit was a flaky third-party API that was down for 8 minutes at 3 AM.

Step 1: Find the correlation ID

Every alert, every error report, every Slack notification should have a correlation ID (also called trace ID or job ID depending on your stack). If it doesn't, that's a separate problem to fix immediately. In my setup, the correlation ID is stamped at job enqueue time and propagates through every log line and HTTP header. Without it, you're not debugging — you're guessing. Check the error report body, the PagerDuty alert payload, or your Sentry breadcrumbs. It'll look something like job_id=a3f8c1d2-99b4-4e7a-bc11-000f72a1e890. Write it down. Everything from here flows from it.

Step 2: Pull all structured logs for that correlation ID in chronological order

This is where structured logging pays back everything you invested in it. In Datadog or Loki, this is a one-liner:

# Loki via logcli
logcli query '{app="worker"} |= "a3f8c1d2-99b4-4e7a-bc11-000f72a1e890"' --from="2024-01-15T02:50:00Z" --to="2024-01-15T03:10:00Z" --limit=500

# Or in Datadog
@correlation_id:a3f8c1d2-99b4-4e7a-bc11-000f72a1e890 service:job-worker

Sort ascending by timestamp, not by severity. You want the story in order, not grouped by what looked bad. I've been burned before by jumping to ERROR lines and missing a WARNING three steps earlier that was the actual root cause.

Step 3: Find the last successful step log entry

Async jobs that span multiple steps should log completion of each step explicitly — something like step=payment_capture status=success duration_ms=143. Scan the chronological log and find the last line where a step completed successfully. Everything after that is your blast radius. This is the boundary that tells you which step to read the code for, which downstream calls to audit, and which team to loop in. If the last successful step was step=inventory_reserve and then nothing, you're looking at whatever happens between reservation and the next step.

Step 4: Check the trace for a span that never closed

If you're on Jaeger, Honeycomb, or Tempo, open the trace for that correlation ID and sort spans by duration descending. A span that never closed — meaning it has no end timestamp or shows an absurdly long duration like 30 seconds when everything else is under 200ms — is almost always the exact line of code that hung or threw. In Honeycomb this is particularly obvious because incomplete spans render visually differently. The thing that caught me off guard the first time: a span can show as "completed" if your instrumentation swallows exceptions without re-raising them. Always check the span's status code, not just whether it closed.

Step 5: Check the retry count and whether the error changed

A job that failed the same way on all 3 retries is a deterministic bug — bad input, broken code path, a record that doesn't exist. A job where the error changed between retries is almost always an infrastructure or concurrency problem. Retry 1 fails with connection timeout, retry 2 fails with duplicate key violation, retry 3 fails with record not found — that's a race condition or a non-idempotent step that partially mutated state on the first attempt. These are the nastiest bugs. The retry count also tells you how much clock time passed — if retries are spaced 30 seconds apart and you have 5 of them, that's 2+ minutes of a potentially corrupt intermediate state sitting in your DB.

Step 6: Check if a downstream service had an outage in that window

Before touching any code, go check your DB metrics, your external API's status page, and your internal service dashboards for the exact window when the job ran. Stripe has a public status page. AWS RDS publishes CloudWatch metrics you can query. I've spent hours chasing "bugs" that were a Postgres instance that hit connection pool limits at 3:04 AM and recovered by 3:07 AM. If you have a correlation between the failure timestamp and a spike in DB connection wait time, you're done — the fix is circuit breaker configuration or retry policy, not application code.

Step 7: Reproduce locally with the original job args before touching production

Only after all of the above do I write any code. And the first thing I write is a reproduction script using the exact original arguments the job received — not approximations, not similar data. Pull the original payload from your job queue's dead letter queue or from the structured log that shows job enqueue:

# Example: re-enqueue a failed job from Sidekiq's dead set using original args
require 'sidekiq/api'

dead = Sidekiq::DeadSet.new
job = dead.find { |j| j.jid == 'a3f8c1d2' }

# Print the original args before you touch anything
puts job.args.inspect
# => [{"order_id"=>"ORD-9921", "user_id"=>44821, "amount_cents"=>4999}]

# Run directly in a local console — not re-enqueue, not production
OrderFulfillmentJob.new.perform(*job.args)

Running against production data directly to "just check something quickly" is how you create the second incident. Reproduce locally, confirm the failure, then fix it. The reproduction also becomes your regression test — wrap it in a spec and you never have to debug that exact failure path again.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.