DEV Community

[Comment from a deleted post]
Collapse
 
zimski profile image
CHADDA Chakib

Hi!

Thank you for your Post.

I can advice this approach to make your system more resilient and prevent duplication:

  1. Each url should have a sate enum [waiting, processing , terminated]

  2. The worker should read and update the state inside a lock, the lock will prevent that two or more worker take the same job because all of them see it at a same time as available.
    in rails you can do it by

 page = website.pages.where(crawled_marker: false).first
 page.with_lock do
   return unless page.waiting?
   page.processing!
 end
 // processs
 page.terminated!

  1. Have a Timeout on the processing time, if the url remains in processing too long, this can be because the worker or the server crashed unexpectedly. so the watcher should reset the task in waiting
Collapse
 
arjunrajkumar profile image
Arjun Rajkumar

Thanks 🙏 This is brilliant!