Edouard CATTEZ

Posted on Feb 6, 2020

The Art of Talkative Application

#logging #logs #devops

There are a lot of articles about this already. Yes, logs and logging system in general is not new. However, our way to develop and to deploy applications in production have changed.

Tools and IDE increase development performances. The phenomenal amount of libraries prevents reinventing the wheel. It seems that today, development is more an assembly of blocks than pure code skills. Does less code leads to fewer logs ?

On the other side, Machines were replaced by Virtual Machines. Virtual Machines were replaced by Containers. Does everything you know about log rotation makes sense in a container world?

Maybe we should go back from the essence of logging. Basically, why our application should “talk” and how? What is the purpose?

I’m pretty sure that there is something essential in logging, regardless of technological choices and technological advances.

Why is it talking?

To achieve your Goal. That’s it.

In fact, logs are messages sent by the application to tell us what is going on. But if we do nothing with such information, it is useless to send them at the first place.

A good starting point is to understand our goal. What do we want to achieve? Do we need functional stats? Do we have an economic strategy based on those logs?

Our goal will directly affect:

the format: field names are one example, log levels are another;
the data: you do not have to log exceptions if you want application metrics.

I found two common mistakes while developing a project:

Not logging at all;
Logging too much.

Not logging at all is awful because you don’t know what is going on. However logging too much is worse. You get noise composed of unimportant information, and you increase the costs for storage and retention.

Once you understand the goal you want to achieve with your logging, you can start creating a strategy that defines what you really want to log. What you really want to log are usable data.

How does it talk?

With the use of a Logging framework.

Logging frameworks provide mechanism to implement logging in a standard and easy-to-configure way. They allow developers to control comments, set log levels, configure multiple formats and destinations, set a log rotation strategy, and more.

As said above, in advanced logging systems, we have a log level, indicating a degree of information, more or less useful for a part of us.

TRACE

Useful if I want to identify the code of a specific feature. Not to be activated in a production environment! Indeed, this level speaks a lot.

DEBUG

Any diagnostic information useful to more people than just developers (sysadmins, ops, …). If the DEBUG level is used in production, it must be limited to the application code. The framework layer (spring for example), tends to talk a lot also in this level.

INFO

In general, information about the normal state of the application (start / stop service, configuration assumptions, etc.). These are information that we always want available, but they keep in mind that they are useless in a normal state.

WARN

Anything that can potentially cause an application problem, but replaced immediately by an alternative solution. By this, we mean to log the error of a try / catch which does not prevent the method from working; to log an attempt to reconnect; to log the absence of non-mandatory data; to log a slave server becoming master of a cluster, indicating the stop of the previous master, …

ERROR

Everything that causes an error and a sudden stop of an operation, but not of the application (impossible to open a file, impossible to recover primary data, fail to connect to a database, …). These errors force the intervention, at least the investigation of the team.

FATAL

Any error causing the service or application to stop. This kind of error is logged to avoid data loss, or future data loss through the service now off. This is a record reserved for the most heinous mistakes and situations where corruption or loss of data is guaranteed.

Remember that the more standardized the logs are, the easier it will be to use them. Setting standards is obviously not a snap of fingers (do not erase half of the population by trying this). It will take time for your log messages to be built the same way (even more when achieved through all your applications).

Format helps get these standards:

Create a standard for formatting: camelCase, snake_case, kebab-case?
Create a naming convention: currency or currency_symbol?
Use a human-readable structuring: JSON? KVP (key = value pairs)? Something else?

What does it say?

It tells you about context.

Being concise and recording short messages is, in general, a good law to respect. However there is a huge difference between writing concise logs and writing useless logs.

Consider the following message:

20190331 10:01:00 unauthorized.

It is short, but useless.

20190331 10:01:00 user_id=781227 action=login result=failure http_status=401

Adding contextual information to your messages creates a story and allows you to understand and analyze the data more easily.

Part of the context that can be added to the logs consists of a bunch of meta-data. Common examples are the name of the application, the name of the function, the name of the class, and so on.

When investigating a specific event through logs, especially in micro-service architectures, it is very easy to get lost in the data. Without having a kind of map to use as a reference, it is almost impossible to follow specific actions on all the different services involved in the same transaction.

Adding unique tags or identifiers to logs, where possible, will help you get back the story, in the correct order. These tags can be whatever makes sense to you: a user identifier, a UUID, something-else-id.

In my experience, we used a correlation identifier also called a transit identifier. It is a unique identifier that is attached to requests and messages that reference a particular transaction or event chain.

Listen to it

Now that our application is speaking, we should open our ears and listen what is going on. In the introduction, I spoke about containerization.

In the good-old-way, logs are stored in standardized files. In fact, path and name respects patterns and files are rotated each hour, each day, each month.

The Rise of the Cloud (a new Star Wars movie?) helped get containerized environments in business information systems. Container used to be small, easily disposable, easily replaceable. In such a philosophy, it is incoherent to store the logs within the container. So basically, in a world of containers, our previous rules could only be available using shared volumes. But should it be? In fact, a good practice in containers is to log onto the standard output.

DLC stands for Dedicated Logging Container

Dedicated logging containers let you manage logging from within the container environment. They can retrieve log events from other containers, aggregate them, then store or forward the events to a third-party-service such as ELK Stack.

Aggregation makes a lot of sense when you deal with multiple applications. Because multiple applications involve a lot of noises, DLC became an effective choice of architecture and helps you focus on data-analysis. In this model, DLC acts as a “log vacuum”. It can also manage logs for specific containers if it has specific constraints. In such a case, we discuss about Sidecar (1 listening-container per application).

What I learnt

To conclude, I would like to talk about my own experience. I worked as a consultant for a French group specialized in Social Protection. We were developing a web application with personal customer services.

Technically, we were in a clustered environment with no log aggregating system. However, for each error, we received a message with the host name from where the application was deployed.

With the right access, we logged onto the host, then we listed the rotated /var/logs/app-name.log. Our goal was to find out which actions were made from our customer and why.

[20170403 08:50:43 - DEBUG - <domain> <hostname>]
  customer_id=IA12345678
  service=retirement_beneficiary
  action=get_payments
  result=success
  message="Partial list of payments sent to customer"
  page=1
  size=20
  page_max=3
[20170403 09:03:51 - ERROR - <domain> <hostname>]
  customer_id=IA12345678
  service=retirement_participation
  action=get_earned_points
  result=failure
  reason="Customer is already retired"
  role_required=RETIREMENT_PARTICIPANT

Even if we had no better tools than the log itself, clarity was enough to diagnose the problem and resolve it when it was necessary. There is always a way.

What I want to prove here is that whatever your environment and your technical constraints, logs can give you everything you need to work well. Keep that in mind and don’t choose the tools before the goal.

Thank you for reading.