Background Job Processing in Ruby without external libraries and dependencies

#ruby #fork #detach

Oringally Posted https://darnahsan.medium.com/background-job-processing-in-ruby-without-external-libraries-and-dependencies-ccdfe1dc5855 Published on June 5, 2020

Ruby gets hammered a lot for its green threads and no real concurrency yet Ruby libraries such as Resque, Delayed Job and Sidekiq are some of the most popular choices to run background jobs in the industry. When you have a huge project and millions of requests and hundreds of thousands of operations to perform, reach out to one of these solutions which are feature complete and you don’t have to reinvent the wheel as its complex piece of software to implement and requires thousands of human hours. While they help you scale sometimes they are not what you are looking for or need as they are meant for a scale of 10000s of jobs per second and what you need is a background processing for few jobs that can be handled in memory without a dependency of a queue such as Redis. If you look at the benchmark numbers from Sidekiq they are for 100K jobs because that is the scale it is meant to be though one can use it for running a few hundred complex jobs as well where you require queue management and supervision.

I am sure you get the point and without further ado let’s dive into what to do if not to use one of these external dependencies. So I wrote this background job processing code (naive and basic) close to 4 years ago and it’s been running since in production without a glitch fulfilling its use case. This is in a task server so when the task fails it has to be restarted so we didn’t need complex queue management and it being a rake task adding Redis as a dependency to run a part of task that required background job support was an overkill. The task this was written was for processing Terabytes of data for Trustyou Metareviews that you can read about here.

So let’s talk about the use case and how I went about solving it. We were required to download 16 sets of data with each data set having 16 files (it’s a hexadecimal dataset). These datasets are stored in AWS S3 so we had to make tasks distributed in a way that we could run per each datasets and each task then runs through its own datasets of 16 files in sequence. Downloading each file in sequence didn’t make any sense as AWS S3 allows you to download multiple files and when files are large enough doing them in parallel makes the total time equal to the slowest download compared to sequential download where total time is the aggregate. This is now something which is doable with the recursive flag on many tools now. Yet this background job implementation isn’t about s3 downloads so don’t worry you will learn a thing or two about implementing your own queue for any long running tasks. As even downloading it with a single command in a background requires some job management and offloading of that process ;)

To download files I used s3cmd .It’s an amazing tool and it handles writing to files on its own compared to using AWS Ruby gem and writing the download from memory to file. Enough of talk lets look at the code and then go through it.

So here is all the code you need to run and monitor you running jobs. First some self promotion if you haven’t noticed it uses Fury to run shell commands for jobs. All that is needed is the Ruby Process module. Process module has a method called fork that allows you to create subprocesses and can take a block it can run and the process terminates with a status. The secret sauce is the detach method that takes the PID which fork method returns and makes sure the process doesn’t keep hanging around as a zombie process if our script was supposed to fail during execution. It brings in some resource safety. Once you start all your background jobs and want to monitor their progress of execution, be it completed or failed. This is where all the PIDs come in handy with some shell magic

Fury.run_now(“ps ho pid,state -p #{job}”

As we have detached the processes once each job finishes it doesn’t hang around as being alive but rather gracefully terminates and PID is released.

That is all you need it’s really easy to implement an in memory queue and monitor your jobs. These are some of the basic building block for creating a background job queue. This can come handy when you want to run few hundred of tasks on your OS and then doing work in phases like resize local images and upload them just pass in a list of images to run in background and then once it finishes give it a list of images to upload in parallel and you have background job queue with some job completion in place. You can even use it to run multiple rake tasks monitor them and build on their execution. The possibilities are countless once you understand the basic of how to handle jobs in background and how to monitor them.

I hope this helped you pick up some pointers just as it was for me when I had to implement it long ago.