One of our applications has a feature where a user can toggle an option which sends an email out to notify someone else. The email is delayed by 30 minutes so that the user can change their mind (or realise they toggled the wrong thing in the first place...). All well and good.
A while ago there was a small change to make to the code and email template, so we pushed out a new version. That push is via deployer which does 'zero downtime' deploys by symlinking a release directory to a 'current' one which the app is actually served from. It keeps the previous five releases by default so you can roll back if needed. All well and good.
The app uses a background queue to send out this particular email - the job is delayed by 30 minutes and before it finally sends it checks if the toggle is still in the 'yes - do that' mode. When we deployed the mentioned change we manually restarted the whole queue just to make sure it was in sync with the new code. Again, all was well and good - everything was working fine. We pushed out another handful of small updates over the course of the day.
Six weeks later...
We had an email from one of the office administrators saying they didn't think emails were going out from the system any more. We'd had a fairly brutal power-cut just a week or so before and my first thought was 'oh, damn - something hasn't restarted properly'. Although there hadn't been any errors caught by Sentry - our log monitor hadn't spotted anything either.
Checking the queue status showed that it was marked as 'paused'. I'd had that before where two apps were sharing a Redis instance and they got confused about which jobs they should be processing - sure enough there was another app on the server and it was pointing at the same Redis db. So we switched the db value up one on the problematic app and restarted the queue - which were now showing as healthy and a test message went out ok. "Phew!" I thought - nice easy fix and made a mental note to finally get round to making sure all the apps had a designated db so they never conflicted.
Then I started to think - 'hang on, that other app has Redis configured - but it doesn't actually use the queues. It was just a default we've got into the habit of including as "we'll probably use Redis at some point, include the config by default" kinda deal. So did a little more digging...
Checking the apps own error log showed nothing of interest. So checked the syslog on the server itself and saw lots of 'Could not open file /path/to/release/65/' messages from the queue worker before I had restarted them. Which was... odd. Especially as our log monitor hadn't picked it up.
The queue workers are started via systemd and point to the
/path/to/current/ so that they should be using the latest release when they are run. But un-noticed by me, they actually see the symlink and resolve it to the 'real' directory so had gone to '/path/to/release/65/' which was where 'current' was pointing at the time.
Since the last manual 'restart the whole queue process' there had been five other small changes pushed out - each of those should have restarted the queue workers so it shouldn't have been a problem. But, again, un-noticed by me - the command used to restart the queues doesn't restart the master process - just tells the master process to restart the workers based on it's own idea of the path. So after the fifth push the master process was now pointing at a directory which the
deployer command had removed as being the sixth-oldest release...
Sentry hadn't picked it up as it was never able to load enough of the code to register the handler. It also turned out that when the server had been built the person hadn't run Puppet on it so it hadn't installed the log forwarding/monitor setup. And the systemd process handler thought all was well as the master process continued to run quite happily as it had loaded all of it's code into RAM.
So all in all - the system hadn't sent out a copy of the email for almost six weeks. Thankfully it was a very quiet time of year for that particular feature so only a handful of these notifications had gone awol. But it was an interesting morning tracing back the errors, why all the layers of monitoring hadn't noticed, how the symlinking caught us out and piecing together the history of the server itself.
And now in addition to the existing monitoring, I've taken a bit of the code from the front-end that showed us that the queues were 'paused' to begin with and used it to make a little cron job that does alert Sentry if all isn't well. Probably... ;-)