We have a RabbitMQ service which sometimes can go down.
So need to:
- restart it if is exited with the failure
- send an email notification
Let’s do it via RabbitMQ’s systemd
service (though there are various options, e.g. using the monit
, check the Monit: мониторинг и перезапуск NGINX post).
Will use two options here:
-
RestartSec=
: delay on restart – to have a chance to finish some disk I/O operations if any, just in case -
Restart=
: the condition to be used
Available conditions for the Restart
are:
Table 2. Exit causes and the effect of the Restart=
settings on them
Restart settings/Exit causes | no |
always |
on-success |
on-failure |
on-abnormal |
on-abort |
on-watchdog |
---|---|---|---|---|---|---|---|
Clean exit code or signal | X | X | |||||
Unclean exit code | X | X | |||||
Unclean signal | X | X | X | X | |||
Timeout | X | X | X | ||||
Watchdog | X | X | X | X |
systemd-unit files edit
The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service
.
You can observe it using systemctl cat
:
$ admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service
/lib/systemd/system/rabbitmq-server.service
[Unit]
Description=RabbitMQ Messaging Server
After=network.target
[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop
[Install]
WantedBy=multi-user.target
Do not edit it in the /lib/systemd/system/
directly, like any other file there as it will be overwritten during rabbitmq-server
package next upgrade.
When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system
directory.
To edit an existing service – use the systemctl edit foo.service
with the --full
option:
# root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service
This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service
content and here you can update it.
Restart of failure
Add both options here – Restart=on-failure
и RestartSec=60s
:
[Unit] Description=RabbitMQ Messaging Server
After=network.target
[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop
Restart=on-failure
RestartSec=60s
[Install]
WantedBy=multi-user.target
Re-read systemd
‘s config files:
# root@bttrm-dev-console:/home/admin# systemctl daemon-reload
systemd
will create a /etc/systemd/system/rabbitmq-server.service
file with the new content.
Now get RabbitMQ’s server PID:
# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID
Main PID: 14668 (rabbitmq-server)
Kill it with SIGKILL
(check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:
# root@bttrm-dev-console:/home/admin# kill -9 14668
Check its status now:
# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago
Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
Main PID: 14668 (code=killed, signal=KILL)
Logs:
...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console'
...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
...
And after one minute:
# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago
...
Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server
...
Logs again:
Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console'
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ...
Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...
“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.
email notification
Now let’s add an email notification to be sent if RabbitMQ went down with an error.
Send test email first:
# root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice" admin@example.com
Now you can use ExecStopPost=
or OnFailure=
. OnFailure
looks better – let’s use it.
Create the /etc/systemd/system/rabbitmq-notify-email@.service
file:
[Unit]
Description=%i failure email notification
[Service]
Type=oneshot
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com'
Add the OnFailure
option to the rabbitmq-server.service
using systemctl edit
in the [Unit]
block:
[Unit] Description=RabbitMQ Messaging Server
After=network.target
OnFailure=rabbitmq-notify-email@%i.service ...
Do not forget to reload systemd
files:
# root@bttrm-dev-console:/home/admin# systemctl daemon-reload
Kill RabbitMQ again:
# root@bttrm-dev-console:/home/admin# kill -9 29970
Check logs:
...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console'
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ...
Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...
- Triggering OnFailure= dependencies.
- Started rabbitmq-server failure email notification.
Okay – all works.
Mail logs:
# root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog
2019-02-28 13:48:58 1gzK7S-0007Td-Bt H=alt2.aspmx.l.google.com [2a00:1450:400b:c01::1b] Network is unreachable
2019-02-28 13:51:09 1gzK7S-0007Td-Bt H=alt1.aspmx.l.google.com [172.217.192.27] Connection timed out
2019-02-28 13:51:42 1gzK7S-0007Td-Bt =\> admin@example.com R=dnslookup T=remote\_smtp H=alt2.aspmx.l.google.com [74.125.193.27] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551354702 x34si4667116edb.147 - gsmtp"
2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed
2019-02-28 13:53:53 1gzK16-0006pp-NU H=alt2.aspmx.l.google.com [74.125.193.27] Connection timed out
2019-02-28 13:53:53 1gzK16-0006pp-NU H=aspmx2.googlemail.com [2800:3f0:4003:c02::1a] Network is unreachable
2019-02-28 13:54:59 1gzK16-0006pp-NU =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx3.googlemail.com [74.125.193.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551354899 s45si1200185edm.357 - gsmtp"
2019-02-28 13:54:59 1gzK16-0006pp-NU Completed
2019-02-28 13:54:59 End queue run: pid=29201
2019-02-28 13:55:33 1gzKHl-0007xl-Lm \<= root@dev.backend-console-internal.example.com U=root P=local S=1331
If you didn’t get an email – check the exim
‘s queue:
# root@bttrm-dev-console:/home/admin# exim -bp
0m 1.2K 1gzL3R-0000dn-5h
<root@dev.backend-console-internal.example.com>
admin@example.com
It hangs here.
Run it manually:
# root@bttrm-dev-console:/home/admin# runq
Check logs again:
# root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h
2019-02-28 14:44:49 1gzL3R-0000dn-5h \<= root@dev.backend-console-internal.example.com U=root P=local S=1241
2019-02-28 14:46:48 1gzL3R-0000dn-5h H=aspmx.l.google.com [2607:f8b0:400d:c0f::1a] Network is unreachable
2019-02-28 14:46:49 1gzL3R-0000dn-5h =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx.l.google.com [173.194.68.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551358009 w11si208223qvc.68 - gsmtp"
2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed
And your email:
To solve sending email issue (not sure why exim
won’t send them) – add some dirty “hack” to the /etc/systemd/system/rabbitmq-notify-email@.service
– the ExecStartPost
option:
...
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com'
ExecStartPost=runq ...
To remove an old message from the queue – use their IDs:
# root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf
Message 1gzVar-0003oO-Rf has been removed
Done.
Similar posts
- 03/01/2019 Linux: редактирование systemd юнит-файлов, рестарт сервиса при падении и уведомление на почту (0)
- 02/06/2017 Exim: Mailing to remote domains not supported (0)
- 08/17/2017 Email: Exim и Dovecot – настройка SSL/TLS от Let’s Encrypt (0)
- 02/10/2018 Ansible: миграция RTFM 2.10 – Let’s Encrypt, NGINX SSL, hostname и exim (0)
- 02/24/2019 Arch Linux: installing with EFI and Windows dual-boot (0)
Top comments (6)
I just implemented something very similar and struggled with the email not being sent.
After some investigation, I found out that the cause was that
mailx
is sending messages asynchronously by default, which is not compatible with the way systemd works (see the last post here)Therefore the following option to
mail
is necessary:sendwait
, i.e. your full mail command would be:/usr/bin/mailx -Ssendwait -s "[%i] failure notification" admin@example.com
Thanks, @jmon, it's interesting.
Although I'm using
mailx
via localssmtp
at my Arch Linux workstations (both at work and at home) for this - didn't notice such issues.I'm not at all an expert in that matter, and won't investigate further. It can probably help others in my situation.
However if you're interested, I can provide you some configuration information if you guide me a bit.
Anyway thank you for your very nice work!
What is the meaning of "%i"?
it will pass the service name to the message, for watching a single service it's not actually necessary.
the idea is that you could use the same OnFailure script on multiple services
Helpful, thanks!