In the previous article, I brought up a point that is rarely discussed: a bad log can be as dangerous as a bug in production.
That leads to a natural question:
If logs are so critical, what should we actually analyze?
In practice, when something fails, the first reaction of the team is to check the logs. That is where everyone expects to find answers.
The problem is that, most of the time, logs don’t deliver.
They are verbose, full of long stack traces generated by frameworks, generic messages, and very little useful context. The information is there, but it is not actionable.
The result is predictable: time wasted filtering noise, difficulty finding the real point of failure, and often the need to reproduce the issue just to understand what happened.
This is not a tooling problem. It is a quality problem.
From a QA perspective, logs are not technical output. Logs are operational evidence, and evidence must be reliable.
Logs are not for developers. They are for the system to explain itself
When an incident happens, no one cares how the code was written.
The question is simple:
What happened?
If the log doesn’t answer that, someone will have to investigate manually. And that costs time, trust, and money.
Standards already exist. The problem is not using them
There is no lack of reference.
The most common and widely adopted approaches are 5W + H (what, where, when, who, why, how) and Event + Context + Outcome. Standards like OpenTelemetry and Elastic Common Schema reinforce the same idea: logs must be structured, contextualized, and traceable.
There is no complexity here. A good log describes an event with enough context, a clear outcome, and the ability to trace it.
What should be analyzed in logs
Flow reconstruction
A good log should allow you to understand the beginning, middle, and end of a flow. If that is not possible, there is an observability problem. Missing logs are blind spots.
Context
Logs must clearly show which entity was affected. Without identifiers like orderId, paymentId, or userId, there is no investigation.
Clarity
Messages must be direct and unambiguous. “Error processing” explains nothing. If you need to read the code to understand the log, it failed.
Severity
Severity must reflect impact. When everything is INFO or everything is ERROR, the signal is lost. Logs should distinguish normal behavior, controlled issues, and real failures.
Traceability
In distributed systems, logs must be connected. Without traceId or correlationId, each log becomes an isolated piece, and isolated pieces don’t explain complex flows.
Critical points
Logs must exist where risk exists: external integrations, state changes, key decisions, retries, and fallbacks. If logs appear only at the final error, it is already too late.
System behavior
Logs should explain what the system did after an event. Did it retry? Fallback? Abort? Without this, the diagnosis is incomplete.
Impact
Knowing that something failed is not enough. Logs should show the impact: was the operation interrupted, was data affected, was the user impacted?
Noise
More logs do not mean better logs. Too much information can be as harmful as too little.
Sensitive data
Logs must not expose sensitive information such as passwords, tokens, or personal data. This is also a quality concern.
Where QA should evaluate this
Logs should not be evaluated only in production.
Code review
Code reviews are usually done by developers, but log quality criteria must be present. Critical points should be logged, context must be sufficient, and messages must be clear. The role of QA is not to perform the review, but to ensure that these criteria exist and are applied.
Tests
Logs should be validated mainly in development-level tests, such as integration tests and, when necessary, unit tests. It is important to verify if logs are generated in relevant scenarios, if the content is correct, and if there are unnecessary logs.
At higher levels, such as E2E and API tests, logs act as support for diagnosis. They should help explain system behavior, allow flow correlation, and reduce the need to reproduce issues.
Incidents
Logs must also be evaluated during incidents. Did they help or slow things down? Were there blind spots?
The real problem
The problem is not the absence of logs. The problem is the absence of criteria.
Without criteria, each developer logs in a different way, each service tells a different story, and each incident becomes a manual investigation.
A simple question
Can you understand what happened without running the system again?
If the answer is no, there is a quality problem.
Final thoughts
Logs are not a technical detail. They are not debug. They are not optional.
Logs are part of the system, and they must be treated that way.
Otherwise, when they are most needed, they will fail.
Top comments (1)
Helpful article