Offloading tasks for background processing is such a great resource in software engineering. And a good job queue does the job of abstracting this complex setup in a safe and maintainable way.
But as the number of jobs and data grows around the queue, we might wonder how much information we have about what is going on in the queue workers.
- How many jobs ran last hour?
- What were the lengthiest tasks during that period of time?
- When did that job stop processing?
- What's the job run rate (how many per hour) and what's its average, max and minimum run times?
Those questions will introduce themselves especially during performance incidents, system malfunctions, data loss, etc - so imagine how useful it is to have an informed dataset about the system health.
For the Laravel Queue there is the good Horizon Dashboard, which is free and readily integrates to redis queues. But if you need customized dashboards or to smash data in order to answers more questions then custom instrumentation comes handy.
For in-house instrumentation where we consume the data ourselves I've been targeting to gather initially:
- name of the procedure (the Job class name in this case)
- time spent processing
- procedure params
- caught exception description
There is a tendency to log out as much information as possible. For optimal instrumentation, don't do that. Include in the instrumentation payload only what's most relevant for the system operation.
Instrumentation data will be really useful only with log ingestion and visualization tools. There are many projects out there in the market, such as the open-source Elastic Stack (logstash + kibana) and the paid service LogDNA.
Over the past year, I've solved lots of performance and data loss situations with a similar approach. Above I'm sharing a gist with a sample Laravel Queue instrumentation.
How about your experience, have you ever instrumented system runtime? What tools and process do your team makes use?