This post originally appeared on Arjun Rajkumar's blog. Arjun is a web developer based in Bangalore, India.
I'm building a web crawler using Sidekiq and Redis and callbacks are perfect for running lean jobs.
Callbacks were especially useful in this scenario : Every time I start crawling a new website, I need to first crawl the home page, get all the links in the home page, and then in-turn crawl those links and find new links to crawl... and continue this till there are no more links to crawl.
Initially I did this using a do loop, and kept running that loop until all the pages were crawled. Each time a page was crawled, I set the
crawled_marker for the page to
true, and kept adding new pages to that loop with
crawled_marker set to
false. The loop ran until there were no more pages with a
crawled_marker set to false.
On pushing this to Sidekiq, I needed to break this down into smaller jobs.
One main job to start the process, and one smaller job for each page. But the problem was that the do loop was running too fast, and pushing duplicates of the same smaller job to the queue. I did add conditions to not crawl a page if already crawled, but this didn't stop Sidekiq from adding the jobs to the queue in the first page.
This is where callbacks were really helpful. I removed the do-loop completely and instead started using callbacks to check the
class Crawler::Callbacks def start_crawl_loop(status, options) account_id = options['account_id'] website_id = options['website_id'] process = Sidekiq::Batch.new(status.parent_bid) process.jobs do new_job = Sidekiq::Batch.new new_job.on(:success, 'Crawler::Callbacks#check_marked_status', 'account_id' => account_id, 'website_id' => website_id) new_job.jobs do Crawler::CrawlWebsiteGetLinksWorker.perform_async(account_id, website_id, status.parent_bid) end end end def check_marked_status(status, options) account = Account.find_by id: options['account_id'] website = account.websites.find_by id: options['website_id'] if account if website.pages.where(crawled_marker: false).count != 0 Crawler::Callbacks.redo_loop_again(status, options) else Crawler::Callbacks.move_forward_loop_done(status, options) end end
check_marked_status() checked to see if there were any pages that were not yet crawled. If yes, then it recalled the
start_crawl_loop, else, it moved on.
I also broke the main crawling website job into smaller parts. I think I can improve this further.
class Crawler::CrawlWebsiteGetLinksWorker include Sidekiq::Worker sidekiq_options :queue => 'critical', :retry => false def perform(account_id, website_id, bid) account = Account.find_by id: account_id website = account.websites.find_by id: website_id if account page = website.pages.where(crawled_marker: false).first one_page_batch = Sidekiq::Batch.new one_page_batch.on(:success, 'Crawler::Callbacks#check_marked_status', 'aid' => account_id, 'wid' => website_id) one_page_batch.jobs do Crawler::CrawlPageLoop.new(account_id, website_id, page.id, bid).crawl_each_page_and_add_links end end end end
Instead of using the
one_page_batch to ensure that each worker just crawls one page at one time, I did try calling another worker this way.
website.pages.where(crawled_marker: false).in_batches do |pages| pages.find_each(batch_size: 100) do |page| Crawler::CrawlEachPageGetLinksWorker.perform_async(account_id, website_id, page.id, status.parent_bid) end end end
This did seem like the better approach, but am getting some strange errors. Need to look into this more.
But on the whole each Sidekiq job is now really small - and am running multiple jobs instead of one big job, while adding more jobs to the current queue.
If you have any suggestion on how I can improve this, do let me know.