Discussion on: Logging done right, makes your life bright!

View post

Having also worked on embedded systems for a good number of years including those that end up in space (GDB definitely not an option!), I now encourage a slightly different approach:

Write debugging / logging user stories, this helps the team keep in mind why they are adding log lines..
Never have a run-time module/level selection mechanism, because that elusive bug will only show up once (en.wikipedia.org/wiki/Heisenbug), instead use a log filter to achieve the same results, while retaining the detail when it's needed.
Maybe think of logging / debugging working like a hardware logic analyser, keep as much recent info in a circular buffer as you can afford storage, take snapshots when 'interesting' events occur (like crashes!), that way you have the detail leading up to the event, but are not overwhelmed by volume, nor exhaust storage when everything is working. We caught two live bugs from space using this technique :)

Shashank Sabniveesu • Nov 8 '17

Good suggestions Phil! I was initially interested to find out what Gajanan meant by

So we used a file watcher to watch the config file for changes and whenever a change was detected, reloaded the logger module config in memory.

But you seem to be advising something different. Can you give an example of how you used

a log filter to achieve the same results
?

Phil Ashby • Nov 8 '17

Hi Shashank, Gajanan makes a good point about being able to adjust logging levels at runtime /while an application is exhibiting a fault/, he says: "once the application is in some problematic state, it would remain in that state".

This is an assumption that is not always true (sadly!), hence my approach of creating a logging system as a detailed (debug level) history recorder albeit with limited capacity. Provided faults are detectable (eg: system crash, exceptions), then the debug level history buffer can be saved at that point, hopefully with enough duration & detail to find the root cause. If you are familiar with the concept of core dump files, this is similar, but more manageable than a dirty great memory blob, and with event history, not just a memory snapshot at the point of failure.

If, in addition to crash traces, we also want to create a long term activity log, we would apply a filter (eg: log levels, log sources or both) to the history buffer, and emit these filtered events to suitable storage (eg: log file).

Such a filter can respond to configuration changes in much the same way that Gajanan describes for setting a 'log config' at source, that is: detecting file changes and reloading itself, thereby changing the activity log content. The important difference is that the debug level history buffer and crash dumps are not affected by changes to an activity filter.

FYI: In one system, we had a separate management card that held the event history in RAM for a number of daughter boards, and filtered this information to disk storage. The management card had a CLI, so activity logging could be managed interactively as well as via configuration files (Cisco IOS has similar features).

Gajanan • Dec 29 '17

I mostly agree with Phill. However unfortunate it is, the following is true! 1. It may be very hard to reproduce an issue with proper log settings. 2. System may not remain in the problematic state forever.

But like I mentioned the size of the log file on some systems are limited. We have a rotating log file of 50MB size. You may not believe me when I say 50MB is not sufficient to capture the logs for 30 minutes when we set log level to debug/trace. We don't have a way of backing up these logs to some server to analyse later. What we have is just 50MB! So we have to apply filters at the generation point!

We have to smartly design our logging system after considering all the limitations of the target system!

Phil Ashby • Dec 30 '17

Sucks to be limited to 50MB and no way to siphon off to longer term storage! How about allocating some of that space to a short term circular log at debug level, enough to capture a few seconds of history or a couple of cycles through the main application loop (probably <1MB?), then as/when bad things happen, copy that buffer to a new file and raise the alarm... you'll likely get 50+ chances to see what's going wrong before running out of space and having to remove the oldest logs.