You join a new team, you are added to their mailing list and alerts slack channel.
The next day, you happen to be the first logging in:
15 Notifications, a couple of applications in production are throwing an error!
You are not so accustomed to the codebase and the infrastructure, you try to follow the breadcrumbs, navigating the repo, searching in AWS Cloudwatch or filtering logs in Kibana, without getting much sense out of all that.
And most of all, you have no idea, what to do!:
Should you restart the server? Should you clear some cache? Should you inform someone?
One hour later or longer, a colleague comes in and you explain the issue, with some apprehension and frustration because you were not able to identify and fix the problem.
Then, he says, shrugging his shoulders :
Oh.. don't waste time looking at that. It's an error, but it's normal.
How normal?!? Why normal?!?
We received an Alert on our Slack channel warning about a critical production error and you are telling me I should not care because this is normal and it is not a worrying sign?
Yes, it's not a problem unless it does happen TOO frequently.
Ah ok. and what does too frequently mean: 50 errors in a hour, in a day, or maybe 20% of the endpoint invocations or 100%?
Really, don't worry, I checked here and there the logs, and nothing suspicious is going on, as I said, everything is fine.
Did this ever happen to you?
Unfortunately this is pretty common ( and unfortunately, this happens also in regards to unit tests or build processes - you are new to a project try to run it and see plenty of errors in the terminal, ask your colleagues worrying there is something wrong in your code or setup, and everybody roll their eyes, telling you to just ignore them and go on).
This might not be officially what some call Alarm ( also Alert) Fatigue, where developers on call duty have to wake up every night because of problems to the live apps.
But even without considering lack of sleep, extra hours and burnout, I find very worrying when team members have to waste time and energy on notifications and alarms that are not relevant.
It is worrying because, when flaky tests or CI pipelines fail too often, at some point none cares anymore if they are failing. And then people don't trust CI and Tests anymore, and either we stop writing tests or we still do lots of things manually.
When there are too many alarms, and most of them are just to notify that something "could" be wrong or broken, at some point, we stop being alarmed by that slack notification or email and nobody will rush to the dashboard and logs to see what happened.
Until something bad happens and nobody reacts.
Error handling, Logging, tracing, metrics, alarms: often all those are overlooked when you are rushing into implementing and delivering your application.
We focus so much into implementing the features that we leave error handling as a last step. We add more and more error catches as we find them, as they occur, during testing or even when already in production. And since often we cannot know if we found them all, we end up with very broad and sensible alarms systems that are trigger on every occasion.
This is not so bad per se, but the team needs to take extra effort and commitment to properly manage the situation as soon as possible ( but again, when an error happens in production, everyone rushes into bugfxing, then the emergency is over and... "we will think about alarms next time, let's get back at our sprint!")
Finding the right balance between letting a critical error slip through and getting flooded by alarms is not as simple.
First we need to decide what is an error ( as an exception in our code or related to business logic) and how we want to handle it. Then we need to prioritize.
What is critical? what is an error that requires monitoring and eventually action? what is just a warning? what can we just log and review every now and then?
- Set up a dashboard that allows a quick overview of the health of your application, plus any stat that could be interesting from business perspective. Invocations, Duration, Soft errors ( those errors from a business logic perspective, but are not crashes or exceptions from your code) etc
- Set up alarms for those metrics that require immediate intervention, human intervention.
Most of our projects are based on AWS and mostly serverless.
This is great for us because all services have already lots of Metrics and it is relatively simple to add Alarms ( as well as creating Custom Metrics).
You can use static thresholds or anomaly detection if the absolute numbers could be misleading - especially if our application has very different usages in different times of the day. (A static threshold of 5 errors in the last 30 minutes could be never met during the midday high traffic, but could trigger call duty over night - if requests are very low, but all failing.)
If you are using slack for notify alerts and alarms, try to reduce the noise.
- use different channels for different projects - so that different team members can silence the notification of the apps they are not "taking care of".
- use different colors/levels of notification so that even if you receive 20 alerts you can immediately spot that red critical one.
If you are using emails a nice trick we also used was to use Email Filters.
Gmail for example has an interesting feature: everything you add to your email address after a + is basically not considered as part of the email address but can be used in your filters, for example firstname.lastname@example.org. or email@example.com. therefore you can create gmail filters based on that email recipient and aggregate and filter out different types of notifications.
Who and when and how often has to look at errors?
Does a service require call-duty?
If something can wait, who will check alarms and errors, first thing in the morning?
You can have all the dashboard you like, you can pile up notifications and email, but none of this is of any help if anyone is actively checking them and taking care of it.
Before the pandemics, when we were working in the office we had a huge screen that was showing all our dashboards so if something was RED, literally anyone could notice, and shout it out.
Being alone at home, everyone sees the Exclamation mark or the number on Slack or Emails, but busy as we always are, we try to not get distracted and we lie to ourselves:
Should I have a look at them? Naaa, I am pretty sure someone else is having a look.
So.. who should keep an eye on those alarms? Who should investigate and react if something is going on?
Try to be clear about that and be fair, or you will end up with some problems going unnoticed, some very dedicated team members end up with burnout because they have to work on their tasks but also spend time on the alarms (not counted in estimates and planning), or you might have 2 ( or more ) people wasting time on the same alarm and doing the same investigation.
What is a runbook. It is a detailed list of the steps that need to be taken whenever an error occur or an alarm is triggered.
- What is that error? (explanation and hints of possible causes)
- Where does it come from? (links to logs, dashboards and repo)
- Where should I start investigating? (links to documentation or areas of code/configuration known to be problematic)
- What can I do to fix it?
- What should I do to mitigate the problem, if I am not able to fix it?
- Who should I contact to escalate the problem or if I realise that it is not in my domain ( maybe a microservice from another department started crashing and our app can't function without it).
Having such a list is of great help because it standardise the approach to the problems, it avoids having "experts" in the team who are the only ones able to investigate and fix issues, and frees every team member from the stress of not clearly knowing what to do when something bad happens ( and we are alone in the office, or at home).
- Don't overlook/underestimate error handling, metrics and monitoring. Allocate time for that.
- Define priorities and responsibilities
- Plan ahead
- Don't let errors go unnoticed
- Stay calm
How do you usually tackle Alerts and Errors in production?
What tools do you use? Do you have any tips to share?