The system failed. Your log should explain why

#observability #softwarequality #qa #softwaredevelopment

In the previous article, I brought up a point that is rarely discussed: a bad log can be as dangerous as a bug in production.

That leads to a natural question:

If logs are so critical, what should we actually analyze?

In practice, when something fails, the first reaction of the team is to check the logs. That is where everyone expects to find answers.

The problem is that, most of the time, logs don’t deliver.

They are verbose, full of long stack traces generated by frameworks, generic messages, and very little useful context. The information is there, but it is not actionable.

The result is predictable: time wasted filtering noise, difficulty finding the real point of failure, and often the need to reproduce the issue just to understand what happened.

This is not a tooling problem. It is a quality problem.

From a QA perspective, logs are not technical output. Logs are operational evidence, and evidence must be reliable.

Logs are not for developers. They are for the system to explain itself

When an incident happens, no one cares how the code was written.

The question is simple:

What happened?

If the log doesn’t answer that, someone will have to investigate manually. And that costs time, trust, and money.

Standards already exist. The problem is not using them

There is no lack of reference.

The most common and widely adopted approaches are 5W + H (what, where, when, who, why, how) and Event + Context + Outcome. Standards like OpenTelemetry and Elastic Common Schema reinforce the same idea: logs must be structured, contextualized, and traceable.

There is no complexity here. A good log describes an event with enough context, a clear outcome, and the ability to trace it.

What should be analyzed in logs

Flow reconstruction

A good log should allow you to understand the beginning, middle, and end of a flow. If that is not possible, there is an observability problem. Missing logs are blind spots.

Context

Logs must clearly show which entity was affected. Without identifiers like orderId, paymentId, or userId, there is no investigation.

Clarity

Messages must be direct and unambiguous. “Error processing” explains nothing. If you need to read the code to understand the log, it failed.

Severity

Severity must reflect impact. When everything is INFO or everything is ERROR, the signal is lost. Logs should distinguish normal behavior, controlled issues, and real failures.

Traceability

In distributed systems, logs must be connected. Without traceId or correlationId, each log becomes an isolated piece, and isolated pieces don’t explain complex flows.

Critical points

Logs must exist where risk exists: external integrations, state changes, key decisions, retries, and fallbacks. If logs appear only at the final error, it is already too late.

System behavior

Logs should explain what the system did after an event. Did it retry? Fallback? Abort? Without this, the diagnosis is incomplete.

Impact

Knowing that something failed is not enough. Logs should show the impact: was the operation interrupted, was data affected, was the user impacted?

Noise

More logs do not mean better logs. Too much information can be as harmful as too little.

Sensitive data

Logs must not expose sensitive information such as passwords, tokens, or personal data. This is also a quality concern.

Where QA should evaluate this

Logs should not be evaluated only in production.

Code review

Code reviews are usually done by developers, but log quality criteria must be present. Critical points should be logged, context must be sufficient, and messages must be clear. The role of QA is not to perform the review, but to ensure that these criteria exist and are applied.

Tests

Logs should be validated mainly in development-level tests, such as integration tests and, when necessary, unit tests. It is important to verify if logs are generated in relevant scenarios, if the content is correct, and if there are unnecessary logs.

At higher levels, such as E2E and API tests, logs act as support for diagnosis. They should help explain system behavior, allow flow correlation, and reduce the need to reproduce issues.