logbloglogbloglogblog

#sre #webdev #monitoring

Why, What, and How We Generate Logs (with OpenTelemetry)

But first

Before we talk about logging, it’s worth reading the Observability Primer by OpenTelemetry (OTEL). Their docs put it better than I ever could and they’ve been working on this for many many years. The purpose of the article is to ask some interesting questions about logging and then to collect some answers from different corners of the web and put them together. I’ll do my best to stay on track.

As OTEL puts it, telemetry is the data emitted from a system and its behavior. This data is usually in the form of logs, traces and metrics.

If you’ve skipped the primer, this should get you going.

Why do we log?

Software systems definitively follow Murphy's Law. It’s not a question of “if” but rather “when”. There’s two things you can do. One is to prepare for the inevitable point of failure. The other is to get back up after said failure. To get back up, you must know when and where you’ve fallen. Only then, you can address the why. Telemetry helps you determine exactly this.

Logs are one such form of telemetry data: a timestamped text record of an event with some metadata. If you’ve seen Cricinfo live commentary:

18:45, 0.1
1w
Jess Kerr to Wolvaardt, 1 wide
back of a length, swinging well down leg. Wolvaardt misses it and the umpire has no qualms in calling it a wide.

You can clearly see the timestamp, the event and the metadata.

If each log entry follows a specified format, then we’ve structured logs (considering keys are also mentioned). If it’s not well-defined, then we’ve unstructured logs.

Okay awesome, so logs help us get back up. What else?

Logs can help us audit, ensure compliance, and detect security threats.

If you’ve ever tried to debug a multi-component system (as most systems are these days), then I’m sure you don’t need much further explanation on why these things are a necessity. When a software system is sick, you can observe the symptoms via logs. You’ll need to sit and find the root cause but it’s still much better when the patient reporting the discomfort can elaborate a little on the said discomfort.

What to log?

Footing that gory storage bill is not fun, but neither is skimping out on logs only for it to come back and bite you later. Therefore, it is important to consider what to log.

This really varies application to application, so please take this with a grain of salt.

Also, we can only log what we are legally allowed to log. Logging sensitive customer data is not a great idea. Logging too much is also not a great idea as the application has to spend effort to add a log. Therefore, we’ve to evaluate this on first principles. Only say as much as required. The minimum description that will help us uniquely identify the moment while conveying as much information as possible. This should be the idea of a log message. A good starting point would be the AWS guide.

In a structured format, the following fields are useful:

Timestamp
Type of log (e.g., info, debug, error)
Message
Context

The context part will help us tie logs into the wider system. For example, with a unique user ID, we can tie it across a string of logs, traces, and metrics to see exactly where it went wrong. Since the data in structured logs is of the same format, this means we can try to store it without any surprise costs.

As with anything, more log messages doesn’t mean more usefulness — in fact, it might mean the opposite, as we have to go through all the clutter.

Standards

Logs should be meaningful and help us get to the why rather than just point at the what and when. Therefore it becomes important to get telemetry to paint a full picture. We don’t treat logs in isolation but as a part of a whole along with traces and metrics.

With structured logs, we get to choose ways like assigning keys that are common across flows so that we can stitch the telemetry together. It makes the logs machine-readable and queryable. OTEL solves the problem of inconsistent formats across logs, traces, and metrics. When every request carries a trace ID and every log line records it, you can jump from a failing endpoint in Grafana straight into the exact log event that caused it. In other words, good logs don’t just describe problems, they help connect the dots.

Another note is to make sure that we preserve and archive logs intelligently. We often discard logs prematurely, whereas we can get so much intel out of them. They may also be useful in the future for retrospective inspections. Eventually, your logs might even weave a wonderful story about dreadful incidents. Lest we forget.

DEV Community