DEV Community: Tom Otvos

Become a Toolmaker

Tom Otvos — Sun, 25 Jul 2021 23:19:03 +0000

In previous posts, I have discussed reducing a problem into its simplest form, and possibly even writing a representative executable to (a) demonstrate the problem, and (b) validate a fix. Today I would like to discuss a variation of that, writing code to simpify the diagnosis of recurring problems.

The Known Issue

In larger organizations, where the code pipeline is strictly controlled – the antithesis of CICD – there will invariably be cases of "the known issue", a problem in shipping code that has been identified, and possibly even with a fix in the works. When a customer reports a problem, there will need to be a bunch of diagnosis during the bug triaging to clearly categorize it as a known issue, and to make sure it is not something else entirely. How it is categorized then defines the steps that follow.

As discussed several weeks ago, making use of tools to chew on logs is standard fare for a master debugger. However, when we are repeatedly triaging bug reports and doing the same steps over and over, then no matter how powerful your toolset is, the process can become very tiring.

Make your tools

tool (noun) : a device or implement, especially one held in the hand, used to carry out a particular function.

The whole point of tools is to make your life simpler. Sure, you could try and push a nail in with your hand, but a hammer makes it considerably easier. Likewise, when you are doing the same triaging over and over, and your workflow involves multiple steps and multiple tools, it makes sense to consider writing a new tool that does the exact analysis you want.

Consider this scenario. The normal triaging for "customer missing data" is:

Use Tool 1 to grab logs from an archive server for the date in question. Note that because the servers are load-balanced, there are multiple logs. Also, different customers are on differents sets of servers, so we need to identify which specific logs to pull.
Export logs from Tool 1, and feed into Tool 2. This involves a network copy to move files from one domain (production) to another.
Use Tool 2 to filter logs for specific customer. The customer is not directly identified in the logs, for security purposes, so you need to make a trip to Database 1 to get the identification "key" for the customer.
From filtered results, examine timestamps to look for gaps around time of missing data. Also look for logged exceptions around the time in question.
If this is "the known issue", then also look at execution times of successfully logged items, since it is known that if something takes too long to process, the customer application will time out.

After doing this workflow dozens of times, you will get pretty good at it. You'll have saved queries to do the database lookups, and possibly a bit of automation in your log processing tools. Nonetheless this is tedious, and a poor use of your skills.

But now consider that:

A simple command-line tool can do database lookups very easily.
A simple command-line tool can grab files from an archive server very easily.
A simple command-line tool can be tuned to filter for a specific customer (passed on the command line), do "gap analysis", parse log entries to watch execution times, and spit out "pass/fail" results.
A simple command-line tool can be handed off to anyone to execute.

Now there is always this balance of effort vs. reward, and clearly writing a tool from scratch will take a bit of time. But in my experience, that time is usually way less than what you think because it is just a tool and so you can cut whatever corners you need to get the job done. And if your manual workflow takes "x" minutes, how many x's do you need to accumulate before the tool pays for itself?

My guess is not many.

println

Here are some debugging-related posts I came across this week that caught my eye. I don't do a lot of JS and NodeJS work, but have done in the past. Maybe this will be useful to you.

(Yes, they are both from dev.to, but that is just coincidence.)

How to debug Node.js using the builtin debugger

It’s about time you stop console.loging to debug code. Here, I’ll show you how to use the builtin...

Debugging JavaScript, DOM, CSS and accessing the browser console without leaving Visual Studio Code

With the new in-built JavaScript debugger, you can easily do all the tasks needed to debug in the browser without leaving VS Code.

As a final word, I would welcome any feedback on this or any of my other posts. If you like what you are reading, or not, please let me know. And I would like to also make a call out for things that you would like to read here. One thing I am considering is a "The Debugger Is In" kind of thing where you submit debugging problems and we work through them and summarize it here. Half-baked at present, but it might work.

Time Warp

Tom Otvos — Sun, 18 Jul 2021 13:39:10 +0000

Let's talk about time.

Time has a way of going by, really fast. Which is why I am dismayed that it has been so long since my last post. That is definitely not my intent, but when the grind gets grinding, it is tough to carve out time to share my thoughts.

But enough navel gazing, let's get down to time and how it impacts debugging.

The timeline

I have mentioned before that time is a critical piece of data when debugging a problem, especially in a distributed system where you cannot easily just "step through code". The timeline of events , captured in logs and database records, is often the only way you can see the big picture of what is going on, in order to narrow down the problem and get to the root cause.

So obviously it is important to capture time as a part of debug output. But how you do that is just as important as doing the thing itself, so let's double-click on that.

3...2...1...0

You know how in those heist movies, the team synchronizes their watches to make sure the laser beams turn off just as the acrobat is shimming in through the air vent? Synchronization ensures each team member does their thing exactly when they are supposed to. Your watch is 5 seconds too fast? Alarms go off!

When you are debugging across multiple systems it is vitally important that the clocks on those systems are also in sync to allow logged timestamps to be easily correlated. If they are not in sync, then it is impossible to know for sure "this" happened before, or after, "that". To successfully debug in this case, you would need to know how fast or slow a particular system is, and then do the math on each time value. Painful.

These days, this is far less an issue than even a few years ago, since there is so much running in the cloud and premise servers are generally always reaching out to network time servers to get their clocks correct. But older systems still exist, and so it is important to check that off your list.

UTC

Even if clocks are in sync, there is the pesky issue of time zones. Customers (if you are lucky) are spread across the world. Even a single larger customer may be operating in multiple time zones. So debugging issues can become complicated if you have to do the time zone math in your head as you analyze logs.

Oh, and don't forget Daylight Savings Time, if you happen to be looking at logs for March or November!

For these reasons, you should really do yourself a favour and track time in UTC.

Unless you are absolutely positive that your software will never run in a different time zone, UTC makes all time issues go away.

Sure, you need to know what time zone a customer is in so you can translate their bug reports ("this thing happened at 9 am"), but after that you should have no problem looking at logs across systems and geographies to build your timeline.

Unix time

There was a time (heh, heh) when I hated Unix time. For those that don't know, Unix time is the number of seconds since Jan. 1, 1970, midnight UTC, although some systems will emit those values in milliseconds, microseconds, or even nanoseconds (!!!).

But seconds are most common and, because a huge amount of code written with Unix time still use 32-bit integers, these applications will have a bit of a problem on January 19, 2038 when those integers overflow.

So yeah, Unix time is a bit archaic, and it always surprises me when I see it in logs or databases.

But that said, it has a really nice property in that values are emitted as simple integers, making it easy to see when one thing happened before another. Sorting logs with your text editor of choice is also very easy. And since I had to do exactly that recently I am now warming to Unix time. Luckily I have 17 more years to take advantage of it.

Of course, looking at a very large number is pretty opaque when you are trying to relate Unix time to something humans are more used to. There are probably a gazillion tools to do that kind of conversion, but I just have https://www.unixtimestamp.com always open to quickly do the math. Thanks, Dan's Tools!

Beware the filter

As you know, being a master debugger and all, rich log files are big log files. So invariably you are going to be filtering log data to isolate certain flows. But the moment you filter the logs, you are hiding data that could be very material to the issue you are debugging.

Filtering, while useful and necessary, is making assumptions about what is, and is not, important.

I was looking at some logs the other day, filtering on the specific flow I was looking for. Yup, there were the expected log entries. Counting them, there was one missing. Hmm. Looking closely at the timestamps of the expected entries (which were periodic), there was obviously a gap. Since this was a server log, the obvious (and naive) conclusion is that the request for that time period was not made by the client system.

But because the logs were filtered, it was obscuring an important fact. The server was restarting. A lot. And looking at an unfiltered view of the time span where the expected "missing" log entry should be, it became clear that the log entry was not there because the server was restarting at the time.

In another case, for another problem, the log for a particular flow looked correct. Because we were looking at a particular customer's flow, the logs were being filtered by "client ID". Request received, request processed, data pushed out. Yup, looks good.

But again, looking at the timestamps revealed a problem: the time between the receipt of the initial request and the processing was 2 minutes. (Interestingly, exactly 2 minutes.) It was easy to miss because when you are looking at a lot of log data, timestamps tend to blur, especially when they look like "2021-07-17T22:04:03.933-04:00". Which is the minute and which is the second? But there it was, if you looked closely.

The techniques and tips here are pretty basic, and yet absolutely critical to successfully solving issues, especially in distributed systems. Just remember: time is on your side.

println

Here are a couple of debugging-related posts from around the web, for your reading pleasure:

5 Debugging Tips for Beginners

In the process of learning how to code, we inevitably run into problems. Some are easier to solve than others...

Application Introspection and Debugging

Once your application is running, you’ll inevitably need to debug problems with it. Earlier we described how you can use kubectl get pods to retrieve simple status information about your pods. But there are a number of ways to get even more information about your application.Using kubectl describe …

Capturing logs at scale with Fluent Bit and Amazon EKS | Amazon Web Services

Earlier this year, AWS support engineers noticed an uptick in customers experiencing Kubernetes API server slowness with their Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Seasoned Kubernetes users know that a slow Kubernetes API server is often indicative of a large, overloaded cluster …

Debugger Tool Belt: BBEdit (...or whatever your favourite text editor is)

Tom Otvos — Wed, 30 Jun 2021 12:59:00 +0000

Understanding the debugging process deeply is the key to becoming a master debugger. However, any master debugger will have a set of go-to tools to elevate their game. While it is common to write simple tools during debugging, using COTS products that are focused on doing one thing well can greatly accelerate resolution of problems.

In this post, I will focus on a tool I use all the time : BBEdit, by Bare Bones Software.

Because I am a Mac guy, I am focusing on BBEdit, but you can insert your favourite text editor into this discussion and hopefully map its equivalent features.

The power of text

Text files are the lingua franca of software. Your source code is in text files, your log files are (generally) in text files, and just about any high-level diagnostic tool will allow its data to be exported into text files of some form.

For this reason, being able to manipulate text quickly, easily, and flexibly is a debugging power move.

To be able successfully process text data, you need an editor that will:

Be able to consume and efficiently work with large files.
Be able to do advanced searching using grep-style (regular expression) patterns.
Be able to do advanced sorting using grep-style (regular expression) patterns.
Be able to do (2) and possibly (3) on multiple files at once.
Be able to do smart comparisons between files.

I'll drill into these a bit more, showing how BBEdit helps me solve problems.

File handling

A good text editor for debugging needs to handle large files. Log files collected over an extended period of time, especially with elevated debug levels, can be very large and your text file cannot choke on them.

BBEdit can easily chew through files of hundreds of megabytes, and even over 1Gb. If your files are much bigger than that, you might consider pre-filtering your data using grep or, at least, getting it open in BBEdit and then grepping the content into a smaller file for deeper analysis from there.

Related to file handling is file differencing. This can be important when you are running side-by-side tests and looking at log output to see what has changed. BBEdit makes this easy to do, while filtering out whitespace and such. This is a very common thing to do, so there is nothing magic about BBEdit here. But if your editor of choice does not do that in a simple and intuitive way, find another editor.

Search and sort

I put searching and sorting together because both should rely on a common matching algorithm that is, ideally, based on grep-style regular expressions. Yes, that grep. It has taken me a long time to warm to regular expressions but because they are used everywhere it was a sink-or-swim kind of thing. Now that I am fairly comfortable with them, I strongly prefer them.

To be totally up front about it, I can't say that I am an expert at regular expressions (who is?) and so I rely heavily on sites like https://rubular.com to help me out.

BBEdit has really nice feedback as you construct your patterns, and highlights text as you modify the pattern. This lets you immediately know you are selecting the right thing. One benefit of something like rubular.com, though, is the little cheat sheet at the bottom of the page if your grep fu is not as strong as it could be.

But search goes beyond just finding words in the file. Two huge features in BBEdit that build off the regular expression patterns are line matching and sorting, both of which I use heavily.

For line matching, there is the "Process Lines Containing..." menu item, which allows you to pull lines out of the file and copy them to the clipboard or a new document in one fell swoop:

Process Lines Containing...

This command is awesome for high-level filtering of your data, especially if you have a very large initial file. It is especially useful when you have multiple files and are pulling line matches from each of them; use "Copy to clipboard" on each file and simply paste into a new document that has the consolidated lines.

The other killer tool is "Sort Lines...":

Sort Lines...

The power move here is that you can match text from the middle of each line and use it as the sorting key. In the sample above, StartDateTime was a Unix-based timestamp that, being numerical, allowed me to easily sort by that value as opposed to the timestamp on the log row itself.

I cannot overstate how powerful this is. In a recent example, because the log file in question had data from various StartDateTime values interleaved but sorted by "log time", I could not understand the root of the issue I was debugging. But by sorting using a value embedded in the log lines, I was able to immediately see the problem because the human eye is really good at seeing patterns and, in this case, huge runs of replicated values. Yes, I could have pulled this into Excel, written some formulae to parse out values, and then sorted, but...well...Excel.

I should also point out that BBEdit remembers search patterns so, when I had to analyze another log file for a potentially similar problem the next day, I was able to sort, eyeball, and confirm the issue in seconds.

One file, multiple files

I'll mention this for completeness, although I am sure any decent editor will handle this. You need to be able to work with multiple files as easily as one. Search and replace needs to work across files. Sorting less so (and I don't think I can sort multiple files with BBEdit in one command) as long as your criteria are remembered. But whipping through multiple files in the UI has to be effortless.

And while on completeness, look at the other BBEdit-y goodness just in this one menu alone.

Text Editing Goodness

While this has been about BBEdit, I can say that much of this functionality is also available in Notepad+ on Windows. The interface is a bit "dense" for my liking, but I am sure that if that is your go-to tool, you'll be as fluent in it as I am in BBEdit.

Whichever tool you use, my advice is that you learn all the tricks it has to offer so that you can maximize your analysis capability beyond what Cmd/Ctl-F can provide.

Pulling a Thread

Tom Otvos — Tue, 22 Jun 2021 12:59:00 +0000

As an elite debugger, you are conditioned to expect data that highlights a specific problem that needs to be resolved, or to ask for it if it is not there. But there are times when some data is provided through honest attempts but it is still...lacking. What to do?

One approach is to throw up a wall and say, "Sorry, no data, no work."

That is reasonable if, time and time again, you get tossed issues that lack sufficient data. But it is also not very professional.

Sometimes I let frustration get the better of me and push back in this way but, more often than not, I try and make do with what is there and see if I can uncover something useful. It may, rarely, lead to a complete solution but more often it leads to deeper questions that can be asked, and more useful data gathered.

This is about that rare case.

Problem scenario

Imagine a scenario where there is some high level metric computed by a consumer of your data, and you don't really know what is behind that metric. But those that do are pointing at something and saying:

"See this number. It is wrong."

Note that they don't say, "it should be THIS", but only that it is wrong.

Then they present a whole bunch of other examples of the high level metric being incorrect, while not giving the context for THAT NUMBER that is wrong. Rabbit holes are traversed, and other random bits of "helpful" examples are thrown into the pot before it lands in your lap. How do you make any headway with this, let alone solve it?

Again, you could say that you need more, specific, data. When was the last time it worked? What's changed? Did it ever work?

But there is that nuggest way up in the data, where "this number" was wrong. While you don't know what is wrong about it (yet) you can at least try and replicate that number because you have enough context to do so, and see where that leads. It is a small thread, but you pull on it a bit.

The first tug

So you assemble whatever extra data you need to get to that number, and try and recreate it. You account for time zones and other miscellanea. Thankfully you also have database snapshots that drive the computations. And luckily, while THAT NUMBER is usually an aggregation of a lot of rows of data, this sample case turns out to have exactly one row.

I say "luckily" although experience led me to pick that specific case of the wrong number because I saw that a single row of data was feeding it. There were others that had two, or ten, but the single row case was the ultimate reduction of the problem.

And, lo and behold, you get that number. Exactly.

Just as significant, you get that number in the context of other numbers output by the same processing for that single row. And you can now see the intermediate steps that are used to generate that number, and can step through those intermediate results.

You still don't know why (or even if) THAT NUMBER is wrong, but you now have something to dig into. All because you pulled on a thread.

The Great Unravelling

Looking at the other numbers, it becomes immediately obvious where THAT NUMBER comes from, as a sum of a few others. So, are those other numbers correct? Hmm.

Not really expecting much, you realize that there is another way these numbers can be generated, a different process that should yield the same results. For shits and giggles, you run that. And amazingly, the numbers are different. Actually to be precise, all the numbers are the same except for THAT NUMBER.

Looking closely at the code behind both computations, you uncover a subtle but now obvious difference. And looking at the code history, you see that nothing has changed in the errant code's processing, but that the other process has in fact been updated, and that change was not propagated to the broken one. (This, by the way, is one very good reason to not fork code in the name of expediency.)

So now we not only know why the number is different but we have a counter example that shows a (presumably) updated calculation. Going back to the root symptom, it is alleged that THAT NUMBER should be smaller and, indeed, the new calculation is smaller.

Finally, you dig into the detailed specification of what these numbers are supposed to represent (not easily, it turns out) and see that, indeed, the new calculation is the correct one. Yay.

This scenario was not cooked up to make a point, but actually happened. And you might be thinking that I got very lucky. In a sense, there was some luck but, frankly, this was a success as much by asking the right questions as it was by luck. I had no idea what reproducing a single number in a sea of data (and there was a LOT of data collected in the nearly two weeks before this landed on me) would lead to, but it was the only thing I could concretely grab onto...and pull.

And as I said at the outset, it would have been easy to say there is no useful data here, and ship it back with a request to get the "right" data.

But instead of delaying yet again, I dug in for several hours and now we have a path to resolution.

Thread pulling FTW!

Swarming

Tom Otvos — Mon, 14 Jun 2021 12:59:00 +0000

While these posts have generally been about elevating your debugging abilities to Yoda-like stature, today I will like to riff a bit on swarming.

For those not familiar with the term, swarming refers to assembling a large group of people on a problem in order to come to some kind of resolution. Ideally, a swarm should have the following basic properties:

There should be a decent number of individuals to bring lots of "eyes" on the problem.
There should not be so many humans that a large number of them are sitting idle, twiddling their thumbs.
There should be a wide representation of skills and SMEs (Subject Matter Experts) in the swarm.
There needs to be a good (single!) communication channel for the swarm.

Eyes on the prize

The point of a swarm is to bring different perspectives on a problem. While it is possible to swarm with as little as two people, generally that won't be as effective as a larger group. That said, I have found that the very act of verbalizing a problem to even a single person not intimately familiar with an issue can be a catalyst to uncovering a solution.

Generally an effective swarm will have a half-dozen or so people.

Of course, there is no hard and fast rule here, so it is probably more instructive to grok the basic principle rather than slavishly follow some prescription. The effectiveness of the swarm comes primarily from the different perspectives of the people involved. There will probably be one person that is driving the group (as a master debugger, that may be you). But think of a swarm as a brainstorming exercise, where no idea is "stupid". Get stuff out on the table and look at everything.

Remember that scene in "Apollo 13" where they are trying to figure out how to not kill the astronauts? Someone dumps a bunch of hardware on the table, and the engineers around that table try to figure out a solution to the air scrubbing problem. That was a swarm.

On the flip side, a swarm can be too large when a significant number of people are not contributing. This can be a waste of time but, frankly, if a problem is important enough to swarm then maybe that is a cost worth paying. You'll need to decide that for yourself, but when we look at communication later, this might be less of an issue than you may think.

Broad, not deep

A really good swarm will have people with a wide range of skills, and represent a variety of domains. This has two benefits:

Different perspectives often yield insights that you may not have thought of. Even a "silly" question from someone that doesn't know the specific details of an aspect to the problem (note: there are no silly questions!) can uncover a small thread of something that unravels the problem.
Different domains can grease the skids where there is some blocker to moving forward. A manager may not necessarily get the deep technical details of an issue, but can facilitate pulling someone else in to, say, solve a permission or access issue.

A SME can shed light on how something is supposed to work, or how it is supposed to look from a customer perspective, while a team dev might know what gets written to a database table. Correlating this in real-time can often illuminate an issue that might otherwise be missed, and that can lead to finding the solution.

If you are the facilitator of the swarm, this is the time to use the debugging skills you have developed, and work through them with the group. Assume nothing, look at logs, reproduce and reduce the problem, and use the collective analytical skills of everyone in the swarm to drive to a solution.

An open channel

There needs to be a frictionless channel of communication when swarming, and if there is any silver lining to the pandemic it is that those channels have matured at a significantly accelerated rate. There is really no barrier to having a swarm distributed geographically except, possibly, time zones. While you can certainly swarm in a meeting room where everyone is in the same location, and can all see the same whiteboard or presentation screen, I think the importance of the latter makes remote swarming possible.

To be more specific, effective debugging, either alone or in a swarm, requires you to look at code, logs, databases, UI, etc. Therefore whatever channel you use must ensure that everyone is looking at the same thing at the same time, and if you can all look at the same screen then that can happen remotely just as easily.

I was in a swarm this week where I was sharing my screen to look at logs and databases, and then someone else (two time zones away) would take control for a few minutes and show what the customer was seeing as we replayed a scenario over and over, while a third would then show other logs to see what was happening in their specific part of the solution. Transitioning was painless thanks to the tool we were using, and everyone could follow along with advice or questions.

The right tool for the job should also allow for two important side benefits:

The tool should collect all the necessary artifacts of the swarm in one place , so that it can be reviewed later. Or if the swarm needs to be continued with other people, then there is some record of what was done so they can come up to speed. These artifacts could include chat messages, video recordings, screen shots, etc.

Note the emphasis on one place. An effective swarming tool should not spread out artifacts to make them harder to find.

The tool should allow people to drop in and out of the swarm as they are needed. This avoids the thumb-twiddling noted earlier, maximizes the effectiveness of the swarm while respecting people's time, and gives an opportunity for senior management to drop in and out as needed if an issue is highly escalated.

Debugging as a team through swarming can be an extremely effective way of solving deep or hard problems. It can also be a great way to teach others how to debug a problem using the techniques you learn here. Part of mastery is teaching others, so share the wealth! While you are at it, please share this with anyone you think might benefit from it.

Patronus

Tom Otvos — Sun, 06 Jun 2021 12:59:00 +0000

The ELK stack is a very common way to collect, aggregate, and analyze log data. It is used by large enterprises and small teams owing, I think, to its use of open-source software and the ability to be self-hosted, or use managed services, as is your preference.

Last week, I discussed the "L" of the stack or, more precisely, the "F" variant of the similar "EFK stack": using FluentBit to collect logs from disparate systems and do something with them.

In the ELK/EFK model, the "do something" means ship the logs to Elasticsearch, where it is archived and indexed, and then analyze the data using Kibana. In this post, we'll look a bit more deeply into that.

Elasticsearch

From Elastic.co, the company that is the gatekeeper for ELK:

Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.

It is industrial-strength, used by many, many, many companies, and also forms the basis of the AWS offering of the same name (although Elastic goes to pains to differentiate themselves from that).

But I think one of the really huge selling points is this:

Go from prototype to production seamlessly; you talk to Elasticsearch running on a single node the same way you would in a 300-node cluster.

It scales horizontally to handle kajillions of events per second, while automatically managing how indices and queries are distributed across the cluster for oh-so-smooth operations.

As described below, you can take advantage of that to prototype a system and demonstrate the value proposition easily and effectively.

Kibana

While you can certaily query an Elasticsearch database using REST, that is far from convenient. And so the "K" of the ELK/EFK stack is a critical component of the system. Again from Elastic.co:

Kibana is a free and open user interface that lets you visualize your Elasticsearch data and navigate the Elastic Stack. Do anything from tracking query load to understanding the way requests flow through your apps.

It provides the following powerful functions:

intuitive search of data
dashboards to visualize data in charts, graphs, and counters
administrative functions to manage the Elasticsearch archive

Additionally, Kibana can pull data in from other sources too (like Prometheus, for example) so you can use it as your one-stop shopping for logs and metrics. In fact, you may already be using Kibana and so pointing it to Elasticsearch may be all that you need to do.

Quick and Dirty

Assuming that you do not have any ELK/EFK infrastructure in place, how do you get started? Docker, of course. It turns out to be trivially easy to start up an Elasticsearch and Kibana instance using docker-compose, and then point FluentBit at it to deliver the data.

Use the following YAML to get the stack going with a single Elasticsearch node:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.13.1
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - cluster.initial_master_nodes=es01
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic

  kib01:
    image: docker.elastic.co/kibana/kibana:7.13.1
    container_name: kib01
    ports:
      - 5601:5601
    environment:
      ELASTICSEARCH_URL: http://es01:9200
      ELASTICSEARCH_HOSTS: '["http://es01:9200"]'
    networks:
      - elastic

volumes:
  data01:
    driver: local

networks:
  elastic:
    driver: bridge

Start it up with docker-compose up and then go to http:\\localhost:5601\ to confirm that it is running.

On the FluentBit side, you now merely need to create an output that points to this Elasticsearch instance to populate data. Note that one of the really powerful features of FluentBit is that you can define multiple outputs based on your requirements, so you can, for example, ship all log events to Elasticsearch, but then also ship error events to the console.

In the following snippet, all log events are pushed both to the console and this new Elasticsearch instance:

[OUTPUT]
    Name stdout
    Match *
    Format json
    json_date_format iso8601

[OUTPUT]
    Name es
    Match *
    Host 127.0.0.1
    Port 9200
    Index my_new_index

The first time you start feeding events to Elasticsearch, it will create a new "index" based on the name you provided above, and then you can use the "Discover" section of Kibana to start seeing your data flowing in.

Spend time in "Discover" to get a feel for its filtering and sorting capability, and then try creating a new dashboard to chart some of the data (for example, a pie chart showing INFO, WARN, and ERROR messages):

Next steps

There is a great deal more to explore as you start feeding data into your stack. The FluentBit side allows you to parse out key data elements from log lines which then become explicitly searchable in Kibana (such as "tenant" in the example above). As you think about how this might go into production, with multiple sources feeding into it, you'll want to parse out things like:

event timestamp, obviously
log type (DEBUG, INFO, WARN, ERROR)
some kind of process identifier
trace and span identifiers
customer/client/source identifier

Note that the full text is always searchable, but by pulling out key values, they become indexed separately and so become powerful filtering values in your visualizations.

There will be much more on ELK/EFK in the coming weeks, so if you have not already, please subscribe to keep up to date.

To round things out, there are a couple of housekeeping items. First, I am considering moving the regular publishing date from Sunday morning to Monday or Tuesday. Those that know these things suggest there is better engagement that way, but I am soliciting feedback. Please let me know your thoughts at debugmsgio@gmail.com.

Second, I would like to give a shout-out to the developer community DEV.to, where I am cross-posting some of these essays. There is a lot of technical goodness at DEV.to, so please check it out.

The Art of Debugging - Intro

Tom Otvos — Sun, 30 May 2021 16:05:30 +0000

I recently started a Ghost blog, writing about debugging. My basic thesis is that debugging, like programming, is a technical skill that anyone can learn. However debugging well, like programming well, is as much art as it is skill.

In my long career as a software developer and architect, I have had to debug a lot of things. And it continually amazes me, when working with others on a problem, how some people just get lost trying to get at the root of a problem. What seem like obvious next steps elude them.

So this blog is about helping people see the forest for the trees. It shares what I have learned over many years, and many problems, hopefully demonstrating the core concepts of effective debugging in whatever language or problem domain you happen to be in.

If this sounds like something you are interested in, I invite you to subscribe at https://debugmsg.io/. And if you are not interested, but think you know someone that may, please let them know about it.

Debugging Fluency

Tom Otvos — Sun, 30 May 2021 12:59:00 +0000

In previous posts, I have stressed the importance of good logs, and of them being the cornerstone of effective debugging. However as I have also pointed out, we are not always in control of how the logging is done when we are debugging systems of software components, where only a part of that system is something you or your team has written.

In these cases, you would typically want to:

Consolidate all the logs from all components into a single place.
Transform the logs (as best you can) into a common format.
Enrich the logs (as best you can) with useful metadata to link things up.

Once the logs are all consolidated, you are then empowered to do a deep analysis of flows and interactions, assuming you chose to consolidate the logs in some equally deep tool. Examples of such tools include (but are not limited to):

Splunk, https://www.splunk.com
DataDog, https://www.datadoghq.com
Elasticsearch/Kibana, https://www.elastic.co

For today, I am going to focus on the last one because it is insanely popular, and because it can be self-hosted due to its use of open source components.

ELK vs. EFK

The "ELK stack" has become one of the standards for deep log file analysis. It is comprised of three parts:

Elasticsearch, for storage and indexing
Logstash, for gathering and transforming
Kibana, for visualization

In more recent times, the middle piece, Logstash, is often replaced by another open source tool called Fluentd. This is in part because Fluentd has more capabilities in manipulating the log data at ingestion, but equally because FluentBit, the lightweight variant of Fluentd, is super tight and lean, allowing it to run in containerized environments like Kubernetes.

You can find numerous comparisons between Logstash and Fluentd. Here is one, but Google will help you find lots more.

You will frequently hear the term "EFK stack" instead of "ELK stack" to indicate the use of Fluentd or FluentBit. Funnily enough, Elastic.co still calls it "ELK", I think to keep the cool name, and repurposes the "L" for "ELasticsearch".

Over the next few weeks, I'll try and dive deeper into the Elasticsearch and Kibana but, for now, I want to focus on using FluentBit to do log gathering.

FluentBit

Consider a server that is running some mission-critical piece of software. It generates a bunch of logging data but, for disk storage reasons, only keeps the logs around for a short period of time. This log rotation is the key reason why, when things go bad, you often have nothing to work with unless you are on the scene when the problem occurs.

Because it is mission-critical, you cannot bog the server down, so you need something that is lightweight. Because it generates a lot of log information, you also need something that is fast. One approach, of course, is to use some scheduled task that periodically zips up the log files and copies them to a network share. Kludgey but effective.

The downside to the zip-and-copy approach is that it merely solves the archiving problem. Yes, logs will be available if something goes wrong, but it is better to ship the logs into something that provides on-demand as well as proactive analysis capabilities. So you need something that is collecting the log data in real time and shipping it in a more structured fashion.

FluentBit hits the mark because it is small and fast (it is written in C), and has a wide variety of "plugins" and configurations to perform log ingestion, parsing, filtering, and shipping.

This post will not be a detailed tutorial on FluentBit. I found this to be useful, as well as the documention of course, although the latter has some awkward English syntax sometimes. But I will focus on my experiments with using FluentBit to gather logs from that mission-critical system, to give you a taste of its power.

Tailing logs

FluentBit has a wide range of input plugins, but for pulling logs from arbitrary software components, the most effective by far is "tail". Simply, this plugin watches one or more text files, and every time something gets written to that file, it pulls it into its processing pipeline. You can tell it to watch a specific file, or you can wildcard it to watch multiple files. It can't get more basic than that and, in my tests, it was super effective, even when watching files on a remote server.

The "tail" plugin has numerous configuration values, but two are particularly useful. The first allows you to specify that the log file(s) being watched are checkpointed in a SQLite database. This allows FluentBit to restart and pick up where it left off for each file it is watching, which can be handy if log data cannot be missed. Without checkpointing, each time FluentBit starts up it begins reading from the current end-of-file.

I should note that, generally, configuration of FluentBit is somewhat painful. The configuration values are text files that use indentation for grouping values together (not unlike YAML). However there were multiple times where it would not accept my indents as tabs, and I had to create runs of spaces to make things work. Blech.

Parsing

The second useful configuration allows you to associate a specific input source with a "parser". Parsers enable you to transform raw log input into structured data that can then be matched, filtered, or otherwise mutated before shipping to the output destination. In the most general case, the RegEx parser allows arbitrary log lines to be split up to pull out timestamp information (at least) but also, if the log file format is somewhat regular, additional metadata such as span and trace IDs, and other workflow identifiers.

Here is an example of a parser I had devised for my use case:

[PARSER]
    Name swxiface
    Format regex
    Regex ^(?<time>[\d\-]*\s[\d:]*),\d+\s*(?<logtype>\w*)\s*(?<process>\S+)\s*(?<msg>.*C:(?<tenant>\d*).*)$
    Time_Key time
    Time_Format %Y-%m-%d %H:%M:%S

Note that because my typical log line was quite regular, I could easily group parts of each line into meaningful bits, like "time", "logtype", and "process". These get put into the structured "events" that FluentBit then uses in the rest of the processing pipeline.

This structure allows me, for example, to only ship log entires that had a "logtype" of "ERROR". Or, more usefully, I could ship all log information to one output destination, but then also ship "ERROR" log lines to another place. This is what FluentBit refers to as tag-based routing.

Routing to the console

When developing a FluentBit workflow, one of the most useful outputs is to stdout. This allows you to test your source, parsing logic, and filtering, before taking the next step and shipping it somewhere. I'll end this post here, and pick up the Elasticsearch leg next time. But using the parser given above, here is what structured events look like (note the tagged elements):

[25] tail.0: [1622218260.000000000, {"logtype"=>"INFO", "process"=>"r_Worker-2", "msg"=>"(C:2 U:master S:JobScheduler_Worker-2_237 X:264093) [c.i.t.i.s.j.k.KettleRunner] Job result: false", "tenant"=>"2"}]
[26] tail.0: [1622218260.000000000, {"logtype"=>"INFO", "process"=>"r_Worker-2", "msg"=>"(C: U: S: X:) [c.i.t.i.s.j.BaseJob] [customer2.Plugin] [elapsed:0.019904448s]"}]
[27] tail.0: [1622218260.000000000, {"logtype"=>"DEBUG", "process"=>"r_Worker-2", "msg"=>"(C: U: S: X:) [c.i.t.i.s.j.s.JobHandler] [customer2.Plugin]: Execute done"}]

The output needs some more work, because the timestamp is using Unix-style double values instead of something more human-readable. But that said, this part is not supposed to be human-readable. That is the job of the rest of the stack.

Stay tuned.

Sacred Cows

Tom Otvos — Sun, 23 May 2021 12:59:00 +0000

In an earlier post, I discussed the dangers of being wedded to your own ideas. I would like to riff on this because of how often it comes up, in others and even in myself.

When you hold onto a belief without sufficient evidence, you run the very real risk of missing important data to allow you to debug an issue.

"I did this , which should make that happen. So now that that has happened, let's look over there."

Wait, what? When something "should happen", something that you are relying on in the steps that you are debugging, you had damn well better make sure that happened. Validate everything before moving on.

"Trust no one" – Deep Throat

Holding onto a belief too long can also cause you to diminish evidence that may be pivotal to uncovering something or, at least, to distract your efforts. Constantly evaluate and re-evaluate your evidence to see where you might be making assumptions, or holding onto beliefs, that are not justified.

"My code works this way. That other problem is something else."

Uh, no, not necessarily.

Time is on your side here, in the sense that having a very accurate timeline can pinpoint when things go astray. And that time-based data can fly in the face of your sacred cow. You need to let go when the evidence tells you to.

"Shit really started to hit the fan Tuesday night. What did we do Tuesday night?"

What sorts of evidence can you rely on? We have covered log files already, because they are (if properly done) an archive of what has happened in the process you are debugging. And that archive is gold.

Another useful piece of evidence that is often overlooked is a modification timestamp on a database row or file. Knowing when a piece of data has changed can be the turning point for knowing what did, or did not happen.

This came up recently where some database data we were looking at could be updated in one of two ways. The modification timestamp on the row, correlated with the log files, clearly demonstrated what flow was responsible for the update. And one of the core premises on which we were operating, a mighty cow, had to be put down.

The Scout Mindset

In the weird way that these things happen, the issue of sacred cows came up on a podcast this week during an interview with Julia Galef, the author of "The Scout Mindset", a topic she introduced several years ago in her TED talk:

In a nutshell, we are all predisposed to be in one of two modes, or mindsets:

the "soldier mindset", where we staunchly defend our deeply held beliefs, sometimes to the point of irrationality;
the "scout mindset", where we don't care about whether we are right and wrong, and merely want to understand.

Clearly (to me) debugging is about being a "scout", uncovering data to get at an understanding of a problem. And this post has been about being mindful of your inner "soldier". Note that, depending on the circumstances and beliefs, we can quickly flip between these two mindsets, with no one person being entirely a "soldier" or a "scout".

But knowing what you are as truthfully and consistently as you can will help you stay intellectually honest, and a be a powerful debugger in the process.

What's Changed?

Tom Otvos — Sun, 16 May 2021 12:59:00 +0000

To an observer who's go-to method of fixing things is "turn if off and then back on", software can seem pretty mysterious.

"Well, it used to work. It just stopped."

Sorry, but things like that simply do not happen. Software can be complex, but it is 100% deterministic. It doesn't have a mind of its own, is not capricious, cruel, or arbitrary. And so if something used to work but then stopped working, there is a reason for it. Always.

Your mission as an advanced debugger is to uncover what's changed.

And again, as with all of these "advanced" tips, it may seem gobsmackingly obvious, but too many times I have seen people go down rabbit holes without taking a step back and recalibrating their investigation by asking "what's changed".

A short inventory of changes

There can of course be many, many things that have changed, but here is a short inventory to get you thinking about some of the ways software can be impacted:

Changes to the environment, such as base OS, or machine upgrades. Did a server patch or upgrade alter something your software depends on (e.g., a TLS version deprecation, which bit us a while back)?
Changes to network configuration or topology. Arguably this is "environment" too, but deserves a special call-out because of how frequently IP changes or new firewall rules cause problems.
Changes to key services you depend on. Think about web servers, file servers, and cloud services.
Changes to your code.

Note that code changes are listed last, not because code changes aren't likely to trigger issues (they usually do), but because you need to eliminate the obvious first before digging into code. Sometimes it is the combination of new code on your part, coupled with some other change, that tips things over from "it used to work" to not working now.

Effective debugging is about being methodical. You are essentially creating a binary search over the entire space of what can go wrong, and eliminating key things at the outset can reduce the problem space drastically.

A point in time

If you have (usually quickly) eliminated external issues, and now know that something in your code is broken, how do you go about finding the issue? The next most important piece of data you can obtain is when the problem started happening. Note that this is not necessarily when it was first noticed, but when it actually started happening. You may have a customer report, or a QA bug logged, but you need to dig deeper and uncover what the high-level issue being observed is, and try and see when, in time, it first became apparent. This may be through log files, or it may be by using the same tool or application the customer or QA is using. But assume that when it is noticed, and when in started really happening, are two different things.

As a concrete example, we can consider our integration problem from last week. An external system is consuming data passed from our system, and the data appears wrong in the external system. Luckily, that external system has a usable interface for looking back in time. We can see the wrong data and, more importantly, we can see when the data went from being right to being wrong.

So now we can ask the much more precise question: what has changed at this point in time.

Knowing when something happened is a vital clue to determining what has changed, because there are often cases where multiple things change over a period of time, and a problem is not noticed until after those changes are all in play. Which change caused the problem? By knowing the time as precisely as possible, you can eliminate changes that had no effect, and focus on those that did.

In our integration example, a code fix was deployed to solve one problem, and suddenly another problem started to appear. However, that problem was masked by a third problem, which seemed more important because the second problem hadn't been reported yet, and so that was fixed next. By the time the second problem was noted, we had two fixes deployed. Which was the cause? By working backwards, seeing what was really being reported, and seeing how the initial fix had an unintended side effect, we were able to eliminate the fix to the third problem as being the root issue. And furthermore, by seeing what the actual trigger was, we were well on our way to understand the root cause of it all.

Source code control is your friend

While you may know exactly what the issue is based on what version of code was deployed, more often than not all you know is that between this version and that , some functionality went bad. This is where source code control is your (very best) friend.

After you have reduced the problem to a key functional unit of code, rather than just looking at the code and trying to deduce what is going wrong, use source code control to tell you exactly what has changed. Sometimes that might point at a massive change set, but just as often it will point you at one or two key functions that have changed in some material way. This works between releases, and it also works during code development where code is being tested at key milestones. As long as you have the when , that line in the sand that says "before here it was good, after here it was bad", you can inspect the changes made by your or your team one by one and see how they may have contributed to the issue.

TFS has a nice visual way to do a diff between two arbitrary change sets, and Git also enables you to easily compare two commits on a branch. Be sure you know how to use this functionality in whatever tool you use!

Stand your ground!

The last bit of advice circles back to the opening comments: software doesn't just "stop working" 99.9% of the time. (I am hedging a bit because of lingering bugs like Y2K, but you get my meaning I hope.) So that said, if your code has not changed and yet something has stopped working, approach it logically and firmly insist that there must be something else that has changed. Many times, someone will insist nothing else has changed and yet, on closer inspection "oh yeah, there was that one update last week, but it couldn't be that, could it?"

It could.

Over the last several weeks, I have tried to drill into the core concepts, the "power moves" of effective debugging. These principally involve asking the right questions and focusing on the things that matter the most, instead of working at random or being pulled into unproductive rabbit holes.

Moving forward, I intented to amplify on these core concepts, sometimes with real-life examples from my day-to-day work, or through deep dives into technology that I think you should know.

Feedback on this, or any other topic mentioned here, is very welcome. Please register to have access to my inbox!

On My Radar

Here are some links to tech that I am actively investigating right now:

Logging is half the battle for understanding what your application or service is doing. Effectively monitoring metrics is a powerful tool for early warning into operational problems. Prometheus is that.

[

Prometheus - Monitoring system & time series database

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

PrometheusPrometheus

](https://prometheus.io)

As mentioned last week, being able to hook into the logging of applications you do not directly control the source for can be vital to capturing diagnostic information. This is a key component of that.

[

Fluentd | Open Source Data Collector

Unified Logging LayerFluentd Project

](https://www.fluentd.org)

Debugging 501

Tom Otvos — Sun, 09 May 2021 12:59:00 +0000

In the past few posts, I have tried to outline some of the key "power moves" that separate top debuggers from others. This post will be the first of the "Mastery" series where we will either do a deep dive into a specific aspect of the power moves, or a deep dive into a real-life scenario to show how these may be applied. Today, we will go over a particularly nasty issue that I have been working this past week.

Problem scenario

Without giving away too many details the scenario is as follows. A very large customer has been having an issue with some data that comes through an integration pipeline between our system and an external system. The issue is that sometimes the numbers are plain wrong. Now, this pipeline has a backup strategy, where temporary inaccuracies are supposed to be patched up nightly, and yet the answers are still wrong. Why?

The crux of this issue is to identify why the backup strategy is not doing what it is supposed to. The infrequent, temporary glitches in the numbers are a known issue that has deep implications for fixing them, so the "plan B" of a backup is both sensible, and acceptable to the customer. That it does not work is simply unacceptable.

Looking at the big picture, we can immediately identify a number of failure scenarios, such as:

The backup strategy needs to be explicitly enabled, so it is possible that it is simply not turned on.
The backup strategy is on, but is failing somehow.
The backup strategy is on, but is also getting the wrong answer.

Show me the logs

Since we know, unfortunately, that the problem can be reproduced by the customer, the first question to ask is "show me the logs". And here is where we hit our first roadblock. There are several separate systems at work here, each with their own logging, and not all of which is under our control. Yes, even on the external system we can turn the logging up, but that needs to be manually done. What is worse, though, is that the logs are aged out (ridiculously) rapidly, meaning that we do not have any logs covering the time period where the backup should be running.

So step one is to turn on deeper logging on that external system, and archive the logs so we can catch the time frame that we need. Curiously, despite the obviousness of this step, it was not until the problem reached my attention that the request to see the logs was made.

Unfortunately there were three wrinkles with that plan:

The logging won't catch the issue for 24h.
The request to archive logs on a production system needs to go through proper channels.
Despite going through channels, the log archiving was still not enabled in time, so we now won't have logs for 24h + 24h.

Rather than waste time waiting for logs that may, or may not, provide the answer we are looking for, we can dig into logs that we have more control over on "our side", logs that are thankfully already archived and searchable. The goal of this would be to reduce the problem surface area, eliminating some of the failure modes.

And that is where we get our first clue as to what might be amiss. There is no record in our logs of the external system reaching out to do the "plan B" backup. Double-checking multiple days worth of logs, and the answer is the same. This customer does not have a "plan B" running successfully or at all.

Eliminate the impossible

Now, it has been told to us that yes, the "plan B" is on for this customer. But we question everything. How can it be on when the logs clearly show it is not? Enter another wrinkle: the configuration of these external systems is pretty tighly under wraps. It takes the right kind of access to see what is there, and so that introduces more delays while access is obtained.

Eventually, though, access is obtained and on very close inspection, the configuration appears to be correct: plan B should be running. Hmm. We really, really need those logs! So we double-check that archiving is, indeed, working now and all we need to do is wait.

But again, in order to not waste time, there are still some facts that can be cleared up. The next most important fact to uncover is the answer to the question: what if the "plan B" is running (even though the logs say otherwise) and we are getting the wrong answer? So we set up a test where we simulate the exact same interface to pull the backup data from our system, and compare the results with the reported errant data. It is tedious work to compare data items line for line, but the results are unambiguous: the backup should repair the bad data if it is called.

Again, we are left to conclude that the backup is not running because: (a) there is no record of it running, and (b) if it were running, the data would be fixed. We really, really need those logs!

Credit: https://tenor.com/view/george-mc-farland-waiting-little-rascal-spanky-yes-gif-8609696

Resolution

Finally, the second 24h ticks over and we can now see what is going on through the archived logs. The backup was initiated by the external system, but the request throws a timeout exception. (As an aside, the logs for this particular system are very good, and there is a clear distinction of overlapping jobs through unique identifiers.) So if the job actually ran, why did it not appear in the other logs?

Looking back at the logs on our system, we still don't see that request coming in, but we do see an error 4 minutes later about a socket disconnecting. The request timeout on the external system was only 1 minute, so this socket exception must be a timeout on our side trying to write data after the external system gave up and shut the request down. Frustratingly, there is absolutely no identifier in the log as to the source being handled at the time of that later exception, but the timing is more than a little bit coincidental. A trace/span ID would have been really helpful here.

But at this point, we have a smoking gun that, frustratingly, we would have had two days earlier if logs had been correctly handled on the external system. Two key takeaways from this exercise are:

We need to have a formal log archiving process on the external system so that we always have searchable logs on hand, without having a gazillion log files on the file system.
We need to ensure that our system logs context with exceptions so we know what it was working on when it barfed.

The problem has yet to be solved, and I'll be sure to update here when it is. But at this point the back has been broken on it. We know what the root cause is, and so we can try and attack it from several different angles. The technical details of the solution are not important here, but hopefully it has been instructive to see the principles I have been writing about applied to this case.

Specifically, we saw how vitally important complete logs are, and how if the logs lack contextual data, then we can miss important facts. We also saw how being able to simulate what the external system was doing allowed us to eliminate one of the possible failure modes. It didn't help solve the problem, but it reduced the number of possible causes. In retrospect, those simulated calls also did take a very long time, and so will be an important tool in validating possible optimization of the backup processing. And finally, by insisting on seeing the specific configuration rather than simply accepting an important fact, we were able to have confidence that the logs (when they eventually arrived) would tell the whole truth.

The most important moral, however, is this. Log files are like backups. When you need them, you really need them, and so it is important to ensure they are capturing the data you need them to be capturing. A log file that is deleted so quickly that the window of time it is relevant for is impossibly small is, frankly, a waste of disk space. If you can influence log file retention for systems you may be asked to debug, exercise that influence and make it right.

If a Tree Falls

Tom Otvos — Sun, 02 May 2021 12:59:00 +0000

"If a tree falls in a forest and no one is around to hear it, does it make a sound?" – Unknown

"If you can’t measure it, you can’t improve it." – Peter Drucker

The major theme for this blog thusfar has been "Duh, that's pretty obvious". And continuing in that vein, today we are going to talk about logging, which is probably the single most important (and obvious) tool in your belt when debugging difficult problems. Of course when you are doing code development, and debugging functionality along the way and through unit tests, log files play a secondary role to the debugger of choice you are using. But when your code gets deployed, and something goes wrong, the log file is the first critical diagnostic aid that you should be asking for outside of the bug report.

But for a log file to be truly helpful in diagnosing issues, there are some key things you need to be aware of as a developer to maximize its value. These key things, which are true regardless of language or execution runtime, include:

Creating useful log output always , regardless of logging level.
Creating variable log levels to allow you (or support techs) to "turn up the volume" on what gets emitted in a log file.
Creating trace and span output in the log file, at all logging levels.
Assuming that log files will be huge, and optimizing for searching.
Outputting what you think you are going to need to debug a problem!

Let's dive into each of these to understand their impact.

Log something useful, always

This first point is so self-evident, it is almost embarrassing to include it. But here it is. When your code is running in production, it is very unlikely to be running in a debug mode. That means, generally, you are not going to be generating a lot of output in your log files. But that does not mean you generate no output beyond:

[2021-05-01] Starting application...

Pretty useless. Even in production, you want meaningful output to show that the program or service is running, and that if something goes awry you have context. Do you have multiple threads? Show them starting up and doing stuff. Are you accepting REST calls? Log them. Do you have some scheduled task that kicks off every now and then? Log when it starts, and when it ends.

If there is a runtime exception, you should be logging that exception with sufficient detail to indicate the cause of the fault, and with sufficient supporting context (like a stack trace) to indicate precisely where the problem occurred. Exceptions are, by definition, exceptional and you should not be worried about filling your log files with exception details. Well, ok, if you are filling your logs with exception details, you should be worried, but not about log file size.

What else is particularly useless about that output example above? No time. Always log the time, to milliseconds ideally, in human-readable form. No Unix time please and, if you are writing code for the Internet, always use UTC. Yes, it takes a bit of getting used to but if you need to correlate activity between different time zones, having a common time format is a huge time saver.

Remember that the log file will be the first piece of data you receive about an issue, so think about triage: what do I need to know right away to narrow down the scope of the issue?

Pump up the volume

Regardless of the logging you always emit, you should always have a way to tune the output, both in terms of the amount of data you put into the log file, and the scope of the functionality being logged. Most (if not all) commonly-available logging libraries support variable logging levels and scope, so this is mostly directed at devs who, for some reason, write their own loggers.

But this is also pointed at the human debugger: once you have a basic idea of the scope of the issue being debugged, and you have a reproducible case, you need to be able to get more information to allow the problem to be further reduced.

As a final note on this, I should point out that when your software is interacting with other software that you do not own, being able to increase the logging in these components can be invaluable to finding the issue. I was working on a problem (mentioned last week) that involved creating a proxy scenario once I had a rough idea of where the problem was. This was a NodeJS thing, and I was interfacing with a standard NodeJS packages for SockJS. The key to my solving that problem was not only to reduce the problem, but to also hook into the SockJS package logging, cranking that way up, and eventually getting enough diagnostics from the customer machine to isolate the problem. Frankly, I didn't know a whole lot about the internals of that package, but was able to figure out how the logging worked and coaxed it into giving out.

Following the thread

The concept of trace and span are fundamental to effective debugging software systems, where the processing is distributed across multiple components. The concept is also useful in monolithic, multi-threaded systems, albeit less so, but in basic terms:

a trace is a sequence of logged steps that represent the entire flow of an operation;
a span is a subset of a trace that is handled by a single component.

Each of the span and trace are typically identified by some unique number (the trace ID or span ID) and that number is passed along the workflow so that different components can emit the same number in logs. Imagine a workflow where you have two separate services handling part of the work. A request for the work comes in, and the first service identifies the trace ID for that work item. It also creates a span ID while it does its particular thing. It then passes the work item over to a second service that generates its own span ID but, critically, it uses the same trace ID as the initial service. In this way, these two bits of work can be connected (or correlated) through the log files.

Standards such as Open Telemetry try and make it easy to pass these values around. But the concept is very fundamental and you don't need a library to do it. As long as there is a single value that can be shared across different parts of a distributed system, you have that critical link that ties everything together. And obviously, that value must be emitted in the log files.

Even within a component that executes work in a multi-threaded fashion, identifying each processing thread is critical because, by definition, multi-threaded components will be doing multiple things at the same time. Log output will be interleaved between the various threads as they execute concurrently, and having a unique identifer to each thread is essential to teasing them apart.

And please note that having the values, and making them obvious in the log file are two different things, so make them obvious and searchable, which leads me to...

Log files need to be searched

Non-trivial debugging will almost invariably involve working through a lot of log files. These files may be huge, and so you cannot assume that you can just open them in Notepad or BBedit and be on your way. Sometimes, grep is your only choice, or you may need to write a small tool to pull data out. Over the years, I have done it all.

But one thing stands out: log files must be text. It is the lingua franca of computer systems and, even if log files are ingested into bigger systems for correlation and searching (e.g., ELK), a text file will always be consumable whereas proprietary "compressed" log formats are just a pain in the ass. For years, I had to work with an inherited logging system that used proprietary file formats, necessitating a custom log reader to look at them. Needless to say, as files got bigger, the reader kept choking on the data, and it was a painful exercise to debug across multiple files.

Fast forward to my recent experience, where I can quickly pull in multiple files into BBedit, filter rows based on traces and spans, consolidate them into a single window that I can then sort by date and time (because they are all UTC, of course) and I can do really powerful analysis and debugging. Smart, simple logging empowers deep introspection and effective debugging.

But as noted above, sometimes grep is your only choice when you have to search across multiple, gigabyte files. And to facilitate effective pattern matching, the log files need to have they key data elements (date, time, trace, span, log message type, etc.) highly structured and easily parsed. Aim to have a standard logging format that is used across all the software you or your company produces, and you'll have developers that can effectively diagnose issues, even on components they did not write.

Think like a debugger

When you are writing code and emitting what you hope are useful log entries, think like you are going to need to debug this exact piece of code six months or a year from now. Are you putting out what you need to make it easy to infer the context? Sure, trace and span identifiers are technically all you need to link everything up, but trace and span identifiers are for machines to use to correlate logs. What about you, the human called on to debug this piece of code?

Make it easy on yourself and add that extra bit of "stuff" to help figure things out. I am currently working in the SaaS world, where requests to services come from literally everywhere. So I need to identify the "who" for a request as much as I need to correlate the log entries tied to that request. How many log files do I need to pore through to find the "who" associated with a particular span? Would it help if I included the "who", redundantly perhaps, in every major step in the processing path?

Yeah, it would help.

There are challenges, of course, to emitting too much identifying data. There are even laws preventing personally identifying information from being logged in some contexts. But those cases are far less common, and merely being aware should steer you clear and still allow you to log effectively.

When your software is in final validation, look closely at the logs. Do they tell the full story, or are there assumptions or missing data? While it may not be part of the acceptance criteria for a component, treat the log output as your personal acceptance gate. You, or a co-worker, will be relying on it someday.

Logging is a very important topic, and later posts will dive deeper into some more specifics, such as Open Telemetry. In the meantime, here are some other technologies that might be of interest:

DataDog, https://www.datadoghq.com
Splunk, https://www.splunk.com
ELK or ElasticStack, https://www.elastic.co/elastic-stack