Imagine that your Rails application is running smoothly. Users are satisfied, payments are processed, emails are sent out, and reports are produced. Sidekiq feels like magic.
Then one Friday evening (because production issues always happen on Fridays), you check the Sidekiq dashboard 💥
Jobs are piling up faster than your interns can say, “Did we forget to add sidekiq_options retry: false again?”
Welcome to Sidekiq at scale—where background jobs stop being background and start being your entire life.
Here are some lessons I've learned from implementing Sidekiq at scale in production, along with metrics and real-world examples.
1. Keep Jobs Small and Fast
Jobs should be atomic and quick. A 5–10 second job is already too long in most production systems.
Bad example:
class ReportJob
def perform(user_id)
user = User.find(user_id)
generate_pdf(user) # CPU-heavy
upload_to_s3(user) # I/O-heavy
send_email(user) # External API
end
end
If this fails halfway, the entire job retries.
✅ Better:
class GeneratePdfJob < ApplicationJob
def perform(user_id)
PdfGenerator.call(User.find(user_id))
end
end
class UploadReportJob < ApplicationJob
def perform(file_path)
S3Uploader.call(file_path)
end
end
class SendReportJob < ApplicationJob
def perform(user_id, report_id)
ReportMailer.with(user_id:, report_id:).deliver_now
end
end
Chain jobs with callbacks or Sidekiq batches.
2. Idempotency is Non-Negotiable
Jobs will run at least once, sometimes more. If jobs aren’t idempotent, you’ll double-charge customers or send duplicate emails.
Instead of:
order.mark_as_paid!
send_receipt(order)
Do:
order.update!(status: "paid") unless order.paid?
ReceiptMailer.send(order).deliver_later unless order.receipt_sent?
3. Use Queues Strategically
Don’t dump everything into default
. Separate workloads:
-
critical
→ user-facing (emails, webhooks, payments) -
default
→ normal jobs (notifications, syncs) -
low
→ heavy, non-urgent (reports, exports, ETL)
sidekiq.yml:
:queues:
- [critical, 5]
- [default, 3]
- [low, 1]
4. Monitor Everything
Visibility is critical at scale:
- Sidekiq Web UI for queue depth & retries
SQL snippet to monitor retries:
SELECT count(*)
FROM sidekiq_jobs
WHERE queue = 'default'
AND retry = true;
5. Control Concurrency
More threads ≠ better throughput. High concurrency can overwhelm Postgres or external APIs.
✅ Start with concurrency = 2–5x your CPU cores.
:concurrency: 15
6. Redis is Your Heartbeat
Redis is the single point of truth for Sidekiq.
- Use dedicated Redis (not shared with cache/session store)
- Monitor memory & latency
7. Smart Retries Only
Default retries (up to 25) can flood your system. Customize them.
class MyJob
sidekiq_options retry: 3
def perform
ExternalService.call!
rescue ExternalService::InvalidCredentials
# Don’t retry, just fail fast
raise Sidekiq::Shutdown
end
end
8. Deploy with Process Separation
Don’t run one mega Sidekiq instance. Separate by queue type.
bundle exec sidekiq -q critical,5 -q default,2
bundle exec sidekiq -q low
9. Dead Jobs are Zombie Failures
Dead jobs = silent failures.
bundle exec sidekiqctl clean --all
Or query them in Redis:
Sidekiq::DeadSet.new.size
10. Autoscale or Suffer
In Kubernetes, ECS, or Heroku, autoscale workers on queue latency:
- Scale up if latency > 30s
- Scale down if latency < 5s
11. Chain Jobs for External API Pagination
Sometimes, the firehose comes from outside your app. A classic case: syncing a huge dataset from a third-party API that only gives you paginated results.
If you naively fetch all pages in one job, you’ll hit rate limits, blow up memory, and make retries a nightmare.
✅ Better: chain jobs page by page. Each job handles one page, then enqueues the next, until the API says you’re done. That way:
Jobs stay tiny and idempotent.
Failures retry gracefully without restarting the whole sync.
You respect API rate limits with backoff and jitter.
You can fan out each page’s items to dedicated jobs for massive throughput.
Example:
class SyncExternalPage
include Sidekiq::Worker
sidekiq_options queue: :sync, retry: 10
def perform(cursor = nil)
response = ExternalClient.fetch_page(cursor: cursor)
response[:items].each do |raw|
UpsertExternalItem.perform_async(raw) # idempotent per item
end
if response[:next_cursor]
# Chain the next page
self.class.perform_in(1, response[:next_cursor])
else
Rails.logger.info("Sync complete ✅")
end
end
end
class UpsertExternalItem
include Sidekiq::Worker
def perform(raw)
ExternalRecord.upsert(
{ external_id: raw["id"], name: raw["name"] },
unique_by: :index_external_records_on_external_id
)
end
end
This pattern lets you chew through millions of external records without drowning your system in retries or hitting API bans.
Conclusion
Sidekiq is incredibly robust, but at scale, you need production-grade practices:
- Keep jobs small, fast, idempotent
- Prioritize with queues
- Monitor latency & retries
- Tune concurrency & Redis
- Use autoscaling to survive traffic spikes
- Chain jobs for external API pagination to handle massive datasets gracefully
With these practices, you can confidently handle millions of jobs per day without breaking a sweat.
✍️ Your turn: what’s your worst Sidekiq horror story? Did you also accidentally email your entire user base at 3am?
Top comments (0)