Javi

Posted on Aug 27, 2021

A whole post about logging? Hell Yeah

#logs #docker #architecture #elasticsearch

Introduction

When I started programming, logging was obviously not in my top 10 priorities. It wasn’t even on the list. In fact, I didn’t know what logging was. All I knew was to cout or Console.WriteLine, and it was enough.

Then I began to work. I understood very rapidly that logs are important. What you write, when you write it and how easy it is to understand from them is crucial for the success of a software.

One of the first tasks I had was to take care of a legacy system. This system wrote logs to the Windows Event Viewer. Never heard of that, but I quickly managed to get familiar with it. Windows Event Viewer (eventvwr.msc) is a nice GUI that comes with any Windows distribution that lets you see all your system’s logs. It is very easy to write to it and to read the logs from it.

The logs were (and, sadly, still are) awful. Warning level used to debug (I assume the reason was because the yellow color is easier to recognize and filter). Names of the developers in the logs. No real information on steps, file names, data and such. Exception stack traces from the developer’s computer (with the developer’s path), and more. You get the deal.

But the worst wasn’t any of that, the worst was the typos.

It came to a point where I started a table with the typos in the logs. You may think I’m a snob, but there were typos everywhere. So what? Big deal! No, it is bad, it made my work a lot more difficult. In order to debug an error, I would go to the source code and lookup for the error. But because of the typos, I never found the text on the first search. I had to make the same typos to find it. For example, small typos were of the type “exeption” instead of “exception”, but there were bigger ones when you really had to read it twice or more to understand the word hiding there.

Lately, I started thinking about things that surround the code. Meaning, not the code and logic of the system, but how it will run, what monitors it, how we’ll deploy it and more. For example, we’ve always written systems that run on Windows Services, now we are writing Containers (more about that here and here). Soon enough I got to think about the system’s logs. Remembering my difficult times I swore to never make something like that and to bequeath that kind of thinking to my peers. I want to take logging to the place they deserve. I want to be proud of my system’s logs.

Let’s start with a simple question.

Why do you need logs anyway?

Logs are good for investigating. If you write a piece of software for homework, then you don’t need to investigate it. Bugs or such are not important nor interesting once the homework was handed in. You can still have logs and it is always a good idea, but they aren’t required.

Logs start to become important, ney, necessary, once the software you write is for more than homework. This situation enforces you to have the ability to correctly investigate what is happening in your system. If an error occurred, you must have the possibility to understand why it happened, be able to restore it and, eventually, fix it.

Logs describe what is happening behind the scenes. It should be enough for a technician to solve problems only by looking at the logs without looking at the code. Logs can alert on issues happening in real-time, they can trigger automations and more.

We now understand that logs are important, we can move on to how to write them.

Let’s start by the log’s levels. Log’s levels are important to understand and correctly use them. Incorrectly used log levels will lead to two kinds of logs: An empty log or a messy log.

If you don’t write anything on the lower levels, you’ll need to be on the higher levels all the time just to see something, this will make your log an empty log for everyday usage.

If you write lots of logs on the lower levels, you’ll have too much information on the log. This will require you to use a lot of memory and disk for processing and storage, leading to the need to delete logs or pay a lot.

You want to find that sweet spot in the middle. It’s not a thin line, but it can be easily missed.

The levels

Different logs’ framework might call their levels different, but these are the common ones:

Trace/Verbose – “So much information” – The lowest level. This level should be able to describe almost line by line what is happening behind the scenes. Parameters values, function calls and such. Exceptions show a full stack trace. This is the technician’s level.

Debug – “Many informations” – This is sometimes mistakenly seen as the lowest level, although there are frameworks that this is actually the lowest level. This level should contain almost the same information as Verbose level with a bit less information and description on inner parameters and calls. If this is the lowest level on your framework, use it as Trace/Verbose above. In the end, it should help you Debug.
Information – “This is what’s happening” – I call this: The Default level. This should contain contour-like information of the process. What is happening, who is interacting with whom and such. This level should be enough to understand what is happening behind the scenes without giving too much information.
Warning – “This flow isn’t entirely ok, but I know how to handle this”. These logs should warn that something bad is happening but it is a handled flow. This can be an expectation failure, input validation failed, a directory missing that we’ll be created and such.
Error – “This flow isn’t ok. I still can handle this, but it will be defaulted and you may not continue with your desired flow”. The best way to think about these logs is in catch blocks, but not only. An error has occurred, the system will not function as it should, but it will continue working. Maybe the current request will fail but upcoming requests might work correctly.

Fatal – “We are doomed”. This means the system will most probably crush. An unhandled exception happened. We don’t know how to recover from this. Manual interference is mostly required.

Frameworks

In the past (up until a few months ago) I thought that libraries that let you distribute your logs to different destinations are the best and a must-have.

For example, the log4net library (log4j for dotnet) lets you write in an xml configuration file, where do you want your logs destined to. It can be a rolling file (a file that gets rolled by date or size), it can be to the console (stdout, stderr) and more.

Log4Net is the first framework I used and it has a special place in my heart. A few years ago, I had to add a feature to an existing system. This system was written many years ago and the logging was custom-made. After adding the feature I was asked for, I asked permission to change all the custom-made logging framework to log4net. This small change allowed later to better the loggings by a lot.

Every language has at least one framework (not to say tens of frameworks) for logging and they all do their thing pretty well.

In .Net Core, logging has even a default dependency injection implemented that you can use (ILogger).

Three months ago (-ish) we deployed our first system entirely based on Docker to a staging environment. We’ve been experimenting a lot with Docker but have never deployed such a system.

I was familiar with the docker logs command, that showed the stdout/stderr of the container, but I haven’t stopped to think about what was happening there. One day I saw the 12Factor article about logging, it said to only write logs to stdout/stderr and let your environment collect those logs for you. This is when I connected the dots. If Docker was able to show me those logs, it probably could write them somewhere. And I was right. Docker collects your logs (stdout/stderr) and writes them into inner log files. Suddenly, all the power of the frameworks to let you decide where you want your logs is dismissed by just writing the logs to the console and letting another system deal with them. Moreover, having to set up the logging destination of each microservice in their configuration instead of just using docker’s logging, was a real drawback.

Why is this important? First, this allows you to use the same logging mechanism on any environment (Dev, Test, Staging, Production and more). You just need your environment prepared to handle them. We move the logging responsibility from the developer of the microservice, to the environment handler.

Secondly, no more configuration files telling the logger where to write the logs to. No more empty commits that only change the configuration file. No more deciding on different formats and then changing them.

What do we do now?

I’ve set up a BELK environment for the logs.

I have a FileBeat container listening for Docker’s log files (it even has a “Container” input type!). FileBeat knows how to talk to the docker daemon to receive information on that container (image name, version, host name and such).

This FileBeat talks to a Logstash container that distributes the logs to the many systems we need to talk to. One of them is a Kafka system from which another Logstash receives the streams and inserts them into an ElasticSearch database and this is all visually from Kibana.

Wow. So much buzz words!

This same method can be cloned almost without any change to any environment or even system.

What do the developers need to do?

Just write their logs. Write away boys! Write logs without caring for anything!

We write the logs in a JSON format to be able to manipulate them easily in Logstash. But other than that, the developer just has to write whatever he wants, he will be able to see it in the console while developing and nothing has to be changed in order to make it production-ready.

Writing logs in JSON format? Why not just plain text?
Let’s say you have a method that needs to read from a file using a session Id.

You could write a log that looks like this:

“Successfully read file [{fileName}] for session [{sessionId}]”
This would look good to the naked eye, but it is not very useful for automations, statistics and graphs. Let’s say I wanted to get a list of all the files I read. I would have to make a regex that extracts that info from the text. It is possible, but it will only get harder and harder.

Instead, you could write a log that looks like this:

“{ “message”: “Success reading file”, “file”: fileName, “session”: sessionId } “

This doesn’t look so bad to the naked eye, and it is machine ready for indexing, creating statistics and such.

I’ve even seen examples where each log type has a unique ID. For our system this is currently an overhead.

We’ve made a nice infrastructure to ease the writing of the logs (of course, how could we live without an infrastructure). We made a list of the most common values we wrote to the log and made them ‘keys’ for our Json logs. This infrastructure is consumed via a small nuget package containing a lot of helper functions.

The best part? They still don’t care where the logs go. They just need to see them on the console and everything else will be taken care of.

What have I learned?

Logging is not as easy as it seems, it can be easier than you think and it is crucially important. You need to think about how it will work from the beginning of the system design. It is crucial to it to be as simple as possibly for the developers and for the sysops. It must be flexible for changes and destinations. For example, using Logstash really helped us when another client suddenly appeared and asked to receive some logs from our system too.

It’s very important to review the logs written; is the level correctly used? For the love of G’d, use a spellchecker, they are free! Have a PR made only for the logs. All of these will help you down the road.

Remember, logs live forever.