There is absolutely no engineer or dev who always writes error free code. In this line of business (Software engineering) errors and bugs will always be around and can definitely be a huge pain in the a** 🤬. Errors can appear in any part of the implementation process right up through sandbox environments to production/already shipped products. The main aim of this article is to try and help prevent, mitigate and handle errors properly. Ok let's go
The main things this article will focus on are
- Detailed self explanatory errors
- Proper notifications management
- Quick response
- Thorough RCA
- Long term fixes
Errors can be difficult and extremely cumbersome to find. The last thing you want after combing through lines of code or log statements is an error message so vague that you wished you hadn't found it. It gets worse when more than one service is involved.
At the very least, you may want to consider being clear and unambiguous with your error responses and relating status codes.
For example, let's say a user who does not have the access to a particular resource tries to access it; while trying to access that resource, your service keeps responding with status code of 204 and an error message of no content. Any new dev or perhaps your future self who may not know that the user does not have permission, may begin to debug for why the resource isn't being sent back since the response is misleading. An appropriate response may be a 401 status code with a response message of "not authorized".
Logs are a very essential aspect of running through the inner workings of application especially during investigations. Many of us may have found ourselves in circumstances where the production server has been down, forcing us to rummage through the logs for causative issues. This is enough a reason to make sure that logs are clear and run through most functions of your application to provide an adequate stack map of previous actions. Thankfully, there are many well known packages which already handle most of the heavy lifting for us. Winston is a good example of such loggers. You may also want to consider the various levels of logs through out the application;
- Info level logs - these are most likely user driven actions that happen while the program runs. These logs could also serve as code smells for highlighting potential causes of errors
- Notice level logs - these logs are to be done when a notable error occurs
- Warning level logs - these logs are usually made to highlight events that may drive the application into a failed state. For example these errors could be made to send alerts when a notable error such as a status 500 happens more than 10 times in 5 mins or when cache storage is reaching its capacity.
- Error level logs - At this level, anything that is regarded as an error must be logged. This may be an internal error or an error response.
It is also important to add context to your log messages. Contextless error messages are basically noise since most of these messages can't contribute to investigation or troubleshooting efforts. Context can be added to a log message by subtly stating the surrounding actions or processes of the error. For example, imagine working on a banking application that throws an error:
In many ways this error is very vague because there are many types of transactions and routines that happen in a banking application. Noting the transaction ID and the type of transaction it is provides greater meaning to the log. ie
userOnlinePayment Transaction 394fs948dne7ndue84nd74: failed
As said earlier, errors are part of the development process and can't be entirely removed, therefore once they occur at an alarming rate, adequate structuring to inform all relevant devs is important. This in many ways is similar to warning and notice logs. Various cloud providers such as Google cloud platform provide out of the box solutions such as monitoring tools and channels to inform engineers whenever necessary.
Notification overload may however be birthed from this if not controlled. Therefore in the case of devops, attaching an infrastructure to the right application is important. This not only relieves some of the devs from excessive notifications, but also ensures that these notifications get to the right devs. Managing the frequency of notifications also reduces desensitization to them.
For most companies, server/service downtime means financial loss which is bad for everyone. Therefore reacting quickly in such moments is of utmost importance. All the earlier points lead up to this one because they contribute to troubleshooting efforts. Troubleshooting efforts are important to not only find the errors but to assess the severity of the error and how long is needed to solve it. This information could be the deciding factor for a rollback or a total shutdown. In some cases, a rollback may be good because it allows for a quick dive into the RCA to generate long term fixes as quick as possible.
After a fix has been deployed or a rollout done, the next course of action is to prevent the error from happening again. RCA discussions are very key to understanding the surrounding concepts of the problem while figuring out that point of failure along with others if there are any. RCAs are better done as a group since the outcomes provide a more diverse facet to the problem. An RCA also means providing long term solutions such as locating all contributing factors including devs who may need extra training or mentorship.
Unlike short term fixes that handle the immediate problem, long term fixes are set in place to make sure errors do not recur while strengthening various failure points.
An example of the difference between a short and long term fix is as follows; Let's say two helper methods call one particular a model method to insert data into the DB. However, just one of these helper methods have implemented a mandatory data mutation before the data is to be stored. This creates the problem of data inconsistency in the DB which affects endpoints that are retrieving this information. The somewhat short term fix will be to implement the said data mutation in the functions that do not have it to stop the entry of dirty data into the DB. On the other hand, the long term fix will be to move the data mutation logic into the model method to prevent the issue from ever happening and then writing a script to cleanse the already inserted data if the data is frequently used. In this example the long term fix not only solves the issue but also enforces a proper code structure and industry standards.
These steps help to put most errors to bed and make the whole development process more manageable. Share your thoughts in the comments if you have any! Catch you in the next one 😁😁