To check if a program is doing what it should, you can inspect the output from a given input. But as the system grows, you also need logging to help you understand what is happening. Good log messages are crucial when troubleshooting problems. However, many developers don’t log enough information in the right places.
I look at logs when I troubleshoot a problem. Sometimes there is a bug causing the wrong behavior. Other times there is no problem – the system behaves correctly, but we had the wrong expectation of what it should do. In both cases, the logs help me understand what is going on in the system. I also check the logs when I am doing exploratory testing after I have finished a new feature. As I play around with the feature, I check that I get the expected result. But I also look at the logs to see that they agree with what I expect.
As a rule of thumb, I like to log when a task starts, transitions, and ends. If it is short, it may be enough to only log at the end. I also like to log in all the places where a task can fail. Here are some further tips for getting the most value from log messages:
Even though modern logging systems can handle large volumes of log messages, it is good to limit what is logged. Some of the worst cases of excessive logging also provide little value. For example, if the skipping condition is simple, “Skipped message of type margin_data_available” is just noise and should be removed. Also, if point A in the code is always followed by point B, then it can be enough to log at point B (possible including information from point A), instead of two log statements.
Even though logging too much is sometimes a problem, it is far more common to not log enough. This especially applies to failures of all kinds (both expected and unexpected). Often when debugging a problem, you want to know how and why a task failed. If there isn’t logging for all the failure modes, then you will sometimes be left guessing.
Frequently, the log message isn’t as helpful as it could be. For example “Connected to invoicing server” can be improved by adding the server IP address and port. Another example: at work, we initialize an in-memory database by reading calendar data from a file. The log message when creating the database includes the name of the file, and the number of lines of data the file contained. This log message was key when troubleshooting a problem due to a truncated data file. In general, try to add as much dynamic information as possible
Even if the log message automatically includes the file name and line number, it is a good idea to make the text in the log message unique and easy to grep for. This makes it easy and fast to find the log statement in the code. If the log message is built from strings, make sure it is still easy to grep for. “Failed: %s %s” in a lot of places makes that hard. If it is difficult to describe the different cases in text, you can even use “Failed (1): %s %s” etc to tell cases apart.
If the logging is done in a function that is called from many places, consider adding an extra argument to the function for a string describing which case it is. That way, it is clear in the log message which case it is from.
When there is some logic on what to do, I like to put that logic in a function that can be unit tested by itself. If the logic is complicated, I also want to generate some log message explaining the decisions taken. There are many ways of doing this, but one way I like is to return a tuple. The first element of the tuple is the result, and the second element is a list of log messages to output (which may be empty). After the function is called, if the message list is not empty, the entries are concatenated and output as a log message.
This way, it is easy to test both that the result is correct, and that the log messages are what I expect. The reason it is a list and not a string is that often there can be more than one important aspect on how the decision was taken. For example, the configuration is forcing the margin type, and the start date of the agreement is in the future. Both of these may be relevant to the decision that the agreement should not be invoiced this period. Also, even though there are multiple items in the list, it will still only be one log statement, not several, which cuts down on the logging.
Occasionally I see log messages like this left in the code: “#### Could not fetch agreement data for ABC-DEF-CSA”. Inevitably, there will then be similar messages that start with, for example, “####—-“ etc. Eventually it all devolves into a contest of whose log message can scream the loudest to be noticed. The log message should not contain any special characters to grab your attention. Let whoever is looking at the logs decide what they want to look at. The only acceptable differentiator is to use different log levels, like WARNING and INFO.
It is hard to get the log messages right on the first attempt. It often takes me several tries before I am happy with what I see. Look at the output, and tweak the messages as necessary. For example, I was logging the agreement UUID when it would not be invoiced. But it turns out that it is much more helpful to log the agreement name and party name instead of the UUID, so I changed it.
Furthermore, if you are troubleshooting a problem, and the logs don’t have all the information you need, remember to add it.
Sometimes, logging can seem redundant, but still be useful. For example, at work we have a heartbeat API-call every minute. There is a metric for this, and an alert is sent if it is not received. However, I also added logging when it is received. Since it is only every minute, it does not generate too much output. It helps understanding what is happening in the system. Also, once when there was a problem with the metrics collection, it was good to be able to check the logs and see that the heartbeats were still coming in.
Troubleshooting is close to my heart, and I have written about it and its relation to logging before: Great Programmers Write Debuggable Code, Finding Bugs: Debugger versus Logging and Session-based Logging.
Good logging is really helpful when troubleshooting. The more you troubleshoot, the more you appreciate logging in the right places, with the right information. When developing a new feature, ask yourself what information you would need if it is not working, then add the appropriate log messages. What is your experience of logging? Let me know in the comments.