Sami Dghim

Posted on Aug 24

Scaling Sidekiq in Ruby on Rails: Best Practices, War Stories, and WTF Moments

Imagine that your Rails application is running smoothly. Users are satisfied, payments are processed, emails are sent out, and reports are produced. Sidekiq feels like magic.

Then one Friday evening (because production issues always happen on Fridays), you check the Sidekiq dashboard 💥

Jobs are piling up faster than your interns can say, “Did we forget to add sidekiq_options retry: false again?”

Welcome to Sidekiq at scale—where background jobs stop being background and start being your entire life.

Here are some lessons I've learned from implementing Sidekiq at scale in production, along with metrics and real-world examples.

1. Keep Jobs Small and Fast

Jobs should be atomic and quick. A 5–10 second job is already too long in most production systems.

Bad example:

class ReportJob
  def perform(user_id)
    user = User.find(user_id)
    generate_pdf(user)          # CPU-heavy
    upload_to_s3(user)          # I/O-heavy
    send_email(user)            # External API
  end
end

If this fails halfway, the entire job retries.

✅ Better:

class GeneratePdfJob < ApplicationJob
  def perform(user_id)
    PdfGenerator.call(User.find(user_id))
  end
end

class UploadReportJob < ApplicationJob
  def perform(file_path)
    S3Uploader.call(file_path)
  end
end

class SendReportJob < ApplicationJob
  def perform(user_id, report_id)
    ReportMailer.with(user_id:, report_id:).deliver_now
  end
end

Chain jobs with callbacks or Sidekiq batches.

2. Idempotency is Non-Negotiable

Jobs will run at least once, sometimes more. If jobs aren’t idempotent, you’ll double-charge customers or send duplicate emails.

Instead of:

order.mark_as_paid!
send_receipt(order)

Do:

order.update!(status: "paid") unless order.paid?
ReceiptMailer.send(order).deliver_later unless order.receipt_sent?

3. Use Queues Strategically

Don’t dump everything into default. Separate workloads:

critical → user-facing (emails, webhooks, payments)
default → normal jobs (notifications, syncs)
low → heavy, non-urgent (reports, exports, ETL)

sidekiq.yml:

:queues:
  - [critical, 5]
  - [default, 3]
  - [low, 1]

4. Monitor Everything

Visibility is critical at scale:

Sidekiq Web UI for queue depth & retries

SQL snippet to monitor retries:

SELECT count(*)
FROM sidekiq_jobs
WHERE queue = 'default'
AND retry = true;

5. Control Concurrency

More threads ≠ better throughput. High concurrency can overwhelm Postgres or external APIs.

✅ Start with concurrency = 2–5x your CPU cores.

:concurrency: 15

6. Redis is Your Heartbeat

Redis is the single point of truth for Sidekiq.

Use dedicated Redis (not shared with cache/session store)
Monitor memory & latency

7. Smart Retries Only

Default retries (up to 25) can flood your system. Customize them.

class MyJob
  sidekiq_options retry: 3

  def perform
    ExternalService.call!
  rescue ExternalService::InvalidCredentials
    # Don’t retry, just fail fast
    raise Sidekiq::Shutdown
  end
end

8. Deploy with Process Separation

Don’t run one mega Sidekiq instance. Separate by queue type.

bundle exec sidekiq -q critical,5 -q default,2
bundle exec sidekiq -q low

9. Dead Jobs are Zombie Failures

Dead jobs = silent failures.

bundle exec sidekiqctl clean --all

Or query them in Redis:

Sidekiq::DeadSet.new.size

10. Autoscale or Suffer

In Kubernetes, ECS, or Heroku, autoscale workers on queue latency:

Scale up if latency > 30s
Scale down if latency < 5s

11. Chain Jobs for External API Pagination

Sometimes, the firehose comes from outside your app. A classic case: syncing a huge dataset from a third-party API that only gives you paginated results.

If you naively fetch all pages in one job, you’ll hit rate limits, blow up memory, and make retries a nightmare.

✅ Better: chain jobs page by page. Each job handles one page, then enqueues the next, until the API says you’re done. That way:
Jobs stay tiny and idempotent.

Failures retry gracefully without restarting the whole sync.
You respect API rate limits with backoff and jitter.

You can fan out each page’s items to dedicated jobs for massive throughput.

Example:

class SyncExternalPage
  include Sidekiq::Worker
  sidekiq_options queue: :sync, retry: 10

  def perform(cursor = nil)
    response = ExternalClient.fetch_page(cursor: cursor)

    response[:items].each do |raw|
      UpsertExternalItem.perform_async(raw) # idempotent per item
    end

    if response[:next_cursor]
      # Chain the next page
      self.class.perform_in(1, response[:next_cursor])
    else
      Rails.logger.info("Sync complete ✅")
    end
  end
end

class UpsertExternalItem
  include Sidekiq::Worker

  def perform(raw)
    ExternalRecord.upsert(
      { external_id: raw["id"], name: raw["name"] },
      unique_by: :index_external_records_on_external_id
    )
  end
end

This pattern lets you chew through millions of external records without drowning your system in retries or hitting API bans.

Conclusion

Sidekiq is incredibly robust, but at scale, you need production-grade practices:

Keep jobs small, fast, idempotent
Prioritize with queues
Monitor latency & retries
Tune concurrency & Redis
Use autoscaling to survive traffic spikes
Chain jobs for external API pagination to handle massive datasets gracefully

With these practices, you can confidently handle millions of jobs per day without breaking a sweat.

✍️ Your turn: what’s your worst Sidekiq horror story? Did you also accidentally email your entire user base at 3am?

DEV Community