Observability is not just about the tool

#architecture #design #monitoring

Logs have been with us since the dawn of coding, and with good reason.
Logging helps us debug during development, understand failures in pre-production and production environments, and observe our systems' proper functionality in general.

With the recent microservices and cloud computing trend, there was a need to create centralized logging solutions that would process logs from hundreds of microservices in real-time.
There are many solutions out there. Some are open-source like ELK; others are paid solutions like Sumo Logic, Coralogix, Logz.io, Loggly, and more.

These services indeed solve many of the issues related to log management and the monitoring of distributed systems. However, how we formulate logs is up to us, the developers.

Logs are, in fact, streams of events.
Reading events from the beginning to a certain point helps us understand the state of a specific snapshot in time.
Log events have metadata derived from the log technical context.
Some examples include: host, application name (ex. microservice name), and receive time.
Another metadata point is the log level, which is set by us, the developers.
Finally, there's the "payload" – a schemaless blob of data – a conversation between the developer to their future self.
But should it be schemaless?

Here’s an example of a log (example 1):

log.debug("Fetched {} email records from DB. Query={}, Params={}",
     new Object[] { emails.size(), query, query.getParams() });

The log event that is the output of this code repeats itself every time; however, the values will most likely change.
While it looks like a schemaless string, it has a repeating pattern.
Using Sumo Logic, ELK, Coralogix, or even good old Grep, we can consider "Fetched {} email records from DB." as a Number field - emailRecordsCount, "Query={}," as a String field called query and Params can be parsed to a list of String values.

Parsing logs into fields in any logging system is the key feature that lets us achieve observability.
After parsing logs into fields, the stream of events transforms into a table. Using this table, we can perform aggregations and analyze things like error count, event streaming latency, login statistics, and many more.
With aggregations, we can build dashboards and set alerts, which are essential to our business continuity.

While we established that every log event is a repeating pattern, there are serious issues with the simple logging approach:

Hard to parse While there is a pattern, it is hard to parse it because the developers don’t declare their intentions. Sometimes you'll need complex regex to parse.
Language & Context Sometimes, it's hard to understand the log meaning and context of the log. Day-to-day logging is like a conversation between a developer and their future self. In many cases, only the developer understands and knows the context of the log event. As time goes by, a developer needs to constantly check their code to understand the original context, the format, and their original intentions when writing the event log.
A tendency for change Most developers don’t cover their logging with unit tests. Changing a log event never changes a feature behavior, so end-to-end tests don’t cover it as well. We can build alerts and meaningful dashboards in services like Sumo Logic, but there's an inherent possibility that it will break, and we won't know it.

Here's how we at Behalf solved these issues:

Using formats like key=value or JSON makes the job of parsing into fields much more straightforward, and with many services (SumoLogic, Loggly... ), you gain auto-parsing abilities.

Just choosing a format is not enough, though.
Take the log event from the example above.
Writing it in JSON format would look something like this:

{
    "emailRecordsCount":"123",
    "query":"SELECT email from people",
    "params": {"firstName":"Jhon"}
 }

But while one developer will write it this way on one microservice, another developer will probably use different names for the fields, like emailCount or query_params on their microservice.
The log event should not be intimate.
For this purpose at Behalf, we distinguish between "Debug logs" and "Audit logs.”

Debug logs
These are the logs similar to the log in example 1.
In our methodology, these kinds of logs have Debug level and are used during development.
There's no need for a particular structure; they can be very verbose, they tell a story that the developer who wrote them knows.
They will be seen in development environments, and CI (Continuous integration) runs, but they won't appear in Staging and Production.
They can change without any worry since they won't appear in production in the first place, and no one will establish any monitoring aspects on top of them.
Pros:

It's straightforward and fast to code these lines.

Cons:

Everything I wrote above (hard to parse, language and context, a tendency to change)

Audit Logs
We created a core library service called AuditService that can be used by all microservices.
The AuditService is a simple abstract class that expects any extension of a BaseAuditRecord.

public abstract class AuditService<R extends BaseAuditRecord> { 
   public void audit(R auditRecord) {
   }
}

The BaseAuditRecord and its extensions are basic POJOs that represent the schema of the event that we want to audit.
The AuditService.audit() method receives the record object and logs it using JSON format (serialize the record into JSON).
While the BaseAuditRecord provides some basic properties like status, duration, etc., we still want to create different audit layers for various purposes.
First and foremost, we use AuditService extensions in our different Core libraries and platforms that run all our microservices.
For example, to audit our persistence layer, we extend BaseAuditRecord with AuditDbRecord.
AuditDbRecord has the following properties:

String operation (e.g., SELECT, UPDATE...)
String query
int numRows

We extend AuditService with AuditDbService
public class AuditDb extends AuditService<AuditDbRecord> {}

The output appears like this:
2021-01-21 21:30:23.981 INFO 20 [sessionId=, principalId=, flowId=, traceId=5a139a230e15de3c, spanId=5a139a230e15de3c] --- [io-8092-exec-15] c.behalf.core.persistence.audit.AuditDb : {"status":"OK","duration":2,"operation":"SELECT","type":"READ","numRows":0,"table":"agreement","queryName":"findByCustomerId"}

Pros:

Testability - it can be easily unit tested. Schema changes are significant and apparent. Suppose we need to change AuditService.audit() lines, we check first if an alert or a dashboard uses it.
The need to create the schema forces the developer to consider observability and monitoring in post-production while in the early development phases. The developer has to determine what tech support teams and DevOps teams will get in terms of logs when the feature goes live.
Platform thinking - Creates a uniform understanding of the auditing layers, mostly if it belongs to the platform. If I'm a developer in team A, I would easily understand team B's services logs because the format, language, and context are the same. The same goes for quality engineers, tech support, and even Product teams - everyone can understand the platform audit language.
Searching, parsing, and aggregating becomes easy. For example, at Behalf, we use Sumo Logic. If I want to check the average latency on event consumption on all microservices in production, I will do this search in Sumo:

_sourceCategory=prod* AuditBusinessEvents* 
| json auto
| avg (latency)

It searches production logs for all log lines created by the AuditBusinessEvents class, using Sumo's 'json auto' feature to auto parse the JSON fields, and then it aggregates to get the average latency within the time frame.
Again, no need to be a coder to search and process these logs!!

Cons:

It takes a little bit more time to implement than regular logs.
People might find the output not verbose enough (well, that’s the purpose).

We use the AuditService to log any fundamental process activity, whether it's a core/platform process or a microservice business logic layer process.
With our structured, testable logs, we gain complete observability, we recognize weak spots, understand our business flows, and easily create alerts and dashboards that don't break.

We established an observability practice that doesn't break and always extends as more features are introduced into our system.

log.error ("Oh no, we reached the end of the post");