A normal workday. The system has a critical failure. The whole team runs to check the logs, and at that moment, they see that the logs have too much unnecessary information. This makes it hard to find the real point of failure. In some cases, logs have too little information. This forces the Dev and/or QA to debug the system, recreating the same conditions where the error happened to try to find the cause. In worse situations, there is no log at that point in the system. Who has never experienced something like this?
Logs are rarely treated as critical points during architecture, code review, and validation phases. They should be. Logs are like home insurance: you only care when something bad happens. In my more than 20 years of experience, I have rarely seen teams treat logs the way they should.
From a QA perspective, analyzing logs often feels like being Indiana Jones — doing archaeology work. Most of the time, logs are not structured and not objective. It becomes a tiring task: unclear messages, confusing stack traces, unnecessary or missing information. A lot of time is lost on something that should be fast. This directly impacts response time to customers, planning, and delivery of the fix. That is why QA often feels discouraged to do this work.
There is usually no standard for logs in companies. Each product uses its own structure, naming, content, and data masking. On top of that, many Devs lack experience — or even technical knowledge — to use these patterns correctly, and mainly to know where in the code logs should be added.
Now imagine this scenario in a critical system for a company. For example, a marketplace microservice responsible for charging credit cards goes down during Black Friday. What would be the team’s response time? How confident would they be that they are looking at the correct point that caused the failure?
Logs should be planned during architecture, refined during development, and also used as a quality criterion in code review and tests.
A good log should answer, at least, these questions:
1- What happened?
2- Where did it happen?
3- In which context (which entity was affected)?
4- What was the result (success, failure, fallback)?
5- What was the impact on the system or the business?
6- Why did it happen (when possible)?
7- What did the system do after that (retry, fallback, abort)?
8- How can I trace this event (traceId, correlationId)?
9- Is this expected or abnormal behavior?
Log is not debug.
Log is operational evidence.
And bad evidence leads to wrong decisions.
Top comments (0)