During university I took on freelance jobs on PeoplePerHour to pay rent. I had a client who I developed a piece of software for. This software, now referred to as 'the script', scrapes a series of websites looking for new job adverts and regex'ing out the good stuff. Once the script completes, I hit the MailGun API with a zipped archive of CSVs. Run-of-the-mill, nothing-complex, easyjob10minutes.
Nothing's ever easy though. Running a ~20min (concurrent reqs) python script and manually making the ZIP was one thing, so automating that process shouldn't be a problem, right? Wrong. Beyond the fact that the sites naturally change the markup of their job adverts which requires rewriting the script each time, I just could not figure out why cron wasn't reliable. There's a .sh to kick off the script and wait for completion... technically, each site is a different python script, written in a micro-framework style to speed up development of any future scraping work that might come my way. Broadly it looks something like:
python3 scrapers/job1.py && scrapers/job2.py && ...
zip results.zip results/
Not elegant, not clever, sequential - but this is running once a week and a failure in one step fails the whole thing and alerts me, at which point I can debug and fix. Each job came to me sequentially, so that structure made sense when it was one, and two, and I've just never refactored it. If it ain't broke. Recently I interviewed at a bank who gave me pretty much the same exact assignment as a code challenge, and my proposed solution was a little tighter, to say the least.
In my mind, running crontab -e and adding the following should mean that I can sip back with a cup of tea:
30 23 * * * 7 cd /root && sh thescript.sh
Alas, upon :wq and a service cron restart, the BCC'd email did not greet me first thing Monday morning. Why crond? Have I not satisfactorily commanded you? No but really, what the hell. That script is being run right? I'm just going to confirm locally that I understand the core concepts here.
* * * * * echo "hi" > /var/log/thescript
!#service crond restart :wq.
cat: /var/log/thescript: No such file or directory
echo "sanity" && service cron status
● cron.service - Regular background program processing daemon
Loaded: loaded (/lib/systemd/system/cron.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-01-03 16:27:59 GMT; 1h 43min ago
Still nothing in that log. Cron doesn't run? The cronfile is active, loaded, and the daemon is started and happy. Why am I not being greeted? It's great that this is locally reproducable but still, I swear this is how it works right? At this point we're not even running scripts, there's no environment necessary, and initd is root so it has write perms on /var/log/*, what the hell is going on.
echo "hi" > /var/log/thescript
bash: /var/log/thescript: Permission denied
Yep. Nice one psedge, you've been using Linux now as a daily driver for what, 8 years? cron can write to /tmp, owned by root:root, but not to /var/log/ owned by root:syslog. Again, initd being run as root, executing the cronjob should indeed be able to write there.
* * * * * whoami > /tmp/who
watch -n0.1 cat /tmp/who
Oh. Each user gets their own crontab, which generates a user-specific file in /var/spool/cron/crontabs - and the cronjob gets executed as that user. I feel like cron should have some log though right, I remember something like /var/log/cron.log right, I'm not insane. If echo (as peter) exited with a non-0 exit code, why didn't /var/log/cron.log get created?
/var/log/cron is indeed the default cron log path... on CentOS. Which is incidentally where I've done most of my sysadmin tasks, just because that was what was used by the companies I'd worked at. But for personal tasks I choose Ubuntu, especially now that they have their Minimal image for container contexts. In Ubuntu, cronjob executions get logged to /var/log/syslog by default.
cat /var/log/syslog | grep -e "cron|CRON"
No exit codes, so that'll have to be managed by the script being executed. I think the lesson here I'm taking from this is also to wrap all logic in the shellscript being executed, as opposed to having multiple commands added in the cronjob itself.
This is kind of what I enjoy about Linux in general; you can use it for a set of tasks and still find core gaps in your knowledge until you need to get something like this done. Although basic, it's good to go back and solidify your understanding, rather than only knowing things like AWS CloudWatch event or GCP Function cron.
update: since writing this I've had one or two more infuriating crond experiences and tend to feel like I had a different tool. Maybe one with better logging, or some kind of dry-run output. I'm not the only one it seems 1