Cynthia Coan

Posted on Sep 25, 2018

Showdev: Coworker - A New Job Queue for the JVM

#showdev #kotlin #java

If you've ever written a ruby on rails app you'll know a very common pattern is to have a Job Runner, like Delayed Jobs, Sidekiq, or a similar framework. However, you may be surprised there aren't that many in the Java Ecosystem! In fact, one of the only standard solutions was Jesque, until now.

Mythra / Coworker

A reasonably-good delayed job queue for the JVM.

ARCHIVAL NOTICE#

Coworker was originally open-sourced from an internal project to showcase the ideas we were testing that worked so well for us. It's fallen significantly out of date, and because of when it was written documentation on Kotlin Coroutines weren't great. As a result I've noticed a few bugs/inneffecienes that exist. I plan a rewrite of these at some point, but haven't really had the time as of yet. If you have any questions though feel free to reach out, and ask about how the SQL/Code works (the SQL is the same as internal and can be trusted to be the most effecient).

Coworker

Coworker is a delayed work runner built for JVM Based languages, written in kotlin. Coworker started off as an experiment as bringin coroutine ideas to background work. Allowing you to work on something else if you're say waiting for an external system.

Specifically Coworker introduces…

View on GitHub

Context

For those of you who haven't dealt with a job queue before, you may be curious why one is needed? What problems does it solve?

Imagine you're developing a web application, and you are needing to do something that can take a long time. For example, you make a call out to another application that may fail, you may want to retry, and altogether it ends up taking a couple minutes. You may start worrying about HTTP Timeouts here, where your server takes too long to respond. If so it might make sense to refactor it out into a job. Then give a user a Job ID they can check for completion status. That way you don't have HTTP Timeouts, and calls still happen in the background. Some examples of these longer tasks that should be run in the background include:

sending newsletters to massive amounts of people
image resizing
batch importing
checking for spam

I'm sure you can imagine more for your specific application. However, if you go looking for job queues in Java you'll find there aren't many solutions! Many people recommend the following things from what I've seen:

Building your own - While this certainly works it does mean you have to build and maintain a solution yourself.
Using a local thread pool - While this certainly works it can present a problem if a web server goes down while having a queue of jobs it hasn't worked yet, leading to a loss of jobs. Not to mention this makes it hard to check the status of a job if you're behind a load balancer that distributes packets across a series of hosts.
Using a Job Scheduler - Some others recommend using a job scheduler like Quartz. While quartz is great, using it as a job queue is specifically marked as an anti-pattern, and for good reason. Running hundreds of ad-hoc image resizing jobs at once with Quartz performs very slowly because Quartz simply isn't built for that.

Coworker to the Rescue

Coworker joins the space of an actual job queue in java, like Jesque. However, Coworker has been built from the ground up to be more efficient in job processing for real-life jobs. By taking ideas built in the fire of production from the ruby library InstJobs, and building on ideas from local execution like Fibers. With these features combined we have the power.

Where that power is executing hundreds of jobs a second on a single box. Specifically, some of the elevator pitch features of coworker include:

Self-Healing: Stick your job nodes in an AutoScaling group, and don't worry about it! Jobs will automatically be rescheduled if an underlying node dies.
Failed Work Table: Allowing you to track the total number of failed jobs.
N-Strands: Now you can limit your jobs based on ad-hoc tags, so you can ensure that one really big customer doesn't starve your entire job queue.
Yielding: That's right the famous key-word used all the time in Fibers/Coroutines is here for your jobs too. Now your jobs have the ability to yield to higher priority work if it comes in the queue.

We bring you all these features (and more), without compromising the most important thing. Being nice to your database. We've implemented multiple safeguards in order to make sure we take up as little room in your database as possible. We know you might be sharing your jobs database, with your customer's database, and we want to not slow you down.

Performance Testing

If you look at the README for Coworker, one of the first things you'll see is our area around performance testing. Specifically, you'll see a graph that looks something like this:

Where Coworker is the line labeled: "1", and Jesque is the line labeled: "0". Now you're probably thinking to yourself "a performance test graph on GitHub, this is probably built to be shown in a light where Coworker will pretty much always win". This is technically true, but only because Coworker is built to run real-life job scenarios much faster!

Specifically, this performance test is as follows:

Queue 200 jobs:
- 100 jobs just echo to standard out and complete immediately after. These are meant to map to "quick running jobs" or jobs that you queue that do little work, and then exit out. I'm sure you'll have some of these in your application.
- 100 jobs that make an HTTP call to google.com, and wait a minute before exiting. This is meant to represent long-running jobs or jobs where you want to wait a bit for it to complete.
Count how many seconds elapse before all jobs are marked as completed.

Although this isn't the perfect representation (and we could never perfectly mirror your application) it is a much better representation of your actual workload, as opposed to just queueing 100,000 jobs that exit out immediately and seeing how long it takes to complete, since this is not the common use case for jobs. (However, it should be noted: if your job queue is like this: Jesque is much faster than Coworker in this specific test. This is because Jesque uses Redis as its store which can query for jobs much faster than SQL ever could.)

If you notice in this case Coworker completes in 70 seconds! Compared to Jesque's near 600 second job completion time. So why is Coworker so much faster here? Well, it's simple! Yield'ing! Coworker has the ability to yield for the HTTP Call job, whereas Jesque does not. So Jesque has all 10 worker threads blocked. Waiting for this long job to complete. Meanwhile, Coworker can yield to other work allowing more work to actually be worked.

NStrands

The next really big thing that Coworker brings to the table is NStrands! NStrands, as mentioned above, are a really great way of ensuring your job pool doesn't get starved in various scenarios. If you've ever run a large app before for a business you'll know some clients are bigger than others. This can lead to a problem if you use a lot of delayed jobs! Specifically, your bigger customers can end up starving your work pool. By that I mean the big customers are queueing so many jobs that all the workers end up doing the work for that one client while everyone else has to wait for their work to be completed.

The big customer probably won't notice they're taking up all your resources. After all, they're just using the system. However, your lower use customers certainly will notice as jobs take longer than before to complete.

This doesn't always have to be a customer thing though. Perhaps you have one job that's very very common, and gets queued a lot. You could run into a similar problem with this job! Where there are so many instances of this job to complete, they also just starve your workers, not allowing much useful work to actually be done.

Using the client example above we could whip up an NStrand rule that looks like this: account:.* -> 5. Next what we'll do is, every time we queue a job we'll "Tag" it (or set its strand) to the value of: account:${account_id_that_queued_this_job}. So for example, if you belong to account 1, and you queue a job it'll be tagged: account:1; if you belong to account 2, it will be tagged: account:2, and so on.

Next, in our hypothetical scenario let's say account "1" is our big client, and account "2", "3", and "4" are our small clients. Let's say we queue a job everytime a user visits a page, to go update our analytics in the background. Well since account "1" is so much bigger we'll say they queue 10 jobs, and account "2"/"3"/"4" each queue 1 job each. We have one job box with 10 worker threads. How will Coworker schedule this work?

The Coworker will actually schedule 8 jobs at once, and leave 5 in the queue! Specifically, it will run 5 "account:1" jobs, and each of the account "2"/"3"/"4" jobs. Because you've set a rule that you only want 5 jobs of a particular account running at a specific time! Then once one of the "account:1" jobs finishes it will see only 4 jobs are working, and then queue the next "account:1" job. This way your big account didn't starve out your one singular worker.

LISTEN/NOTIFY

Lastly, we take advantage of lesser used features of PostgreSQL in order to execute jobs faster (and with less load) than most SQL based queues. Specifically, it uses LISTEN/NOTIFY. Which allows us to validate when new work has entered the queue without needing to poll for it. (It should be noted however that LISTEN/NOTIFY is completely ephemeral. We still need to scan the table occasionally for work that hasn't been locked, and we don't know about. However, we can do it a lot less often).

However, we also worry about building up the DB Memory! Because although LISTEN/NOTIFY lives outside of a "Table" so we don't slow down queries when asking for any notifications that have arrived, these notifications do build up in the databases memory over time! So in order to fix this, we run a sidecar thread that continuously checks for notifications in order to get them out of the DBs memory, and onto the worker node. Where we then push them into a channel. So that way the delayed worker can check them whenever it gets to it, and our database doesn't just use all of its memory.

And Much More...

Coworker has some other features that we haven't talked about, like being able to queue static functors to run in the background, how it doesn't lock you into a specific web framework, and more. However, we've left that as an exercise for you the reader. If you're building a web app in Kotlin/Java, and you’ve needed a delayed job runner, Coworker might be for you!

DEV Community