Hello and welcome back to the DevOps series. This week I will talk about best practices for troubleshooting problems, specifically server and operating system problems, on which the deployed software happens to run. This information will be beneficial to members of Operations teams. Given that we have already seen very core concepts that apply to development, it is appropriate to include a section that system administrators, in particular, can benefit from.
To start, you have a significantly better chance of identifying the problem's root cause if you know how the frameworks and supporting software on the server works.
The simple fact is when people don't understand what their infrastructure is doing, they feel more inclined to blame it for spontaneous problems whose causes are still unknown.
Let's use a simple example. If you are troubleshooting why network requests to a particular URL are failing, someone who doesn't know about firewalls is more likely to blame them for that failure. However, this does not identify:
- Whether it's a malformed firewall rule preventing traffic from going through, or
- Whether the host is unreachable, or
- Whether the host is up but the webserver is down, or
- Whether both the system or web server itself is up, but the program that's supposed to handle the request terminated unexpectedly.
So you see, there can be many potential causes for a failure, so it is imperative to have at least a basic understanding of how each of the technologies in the stack function so that you can narrow down the problem.
If you are in a setting where failures must be resolved as quickly as possible, then you probably cannot afford to wait on slow-running diagnostics tests. Let them run in the background if you can, and start applying the faster diagnostics tests first. If some slow tests can't be automated and have many steps that must be run manually, defer them until you have finished running the faster tests.
A good list of tests to apply first are the ones that have successfully unmasked previous problems from the past. It could be that you have the same situation repeating itself, so running successful tests first will respond to those kinds of issues quicker.
If the same problem keeps reappearing, however, that may be a product of a deeper underlying problem which you should work on resolving instead, provided that the expenses to fix it are worth the effort it will save. For example, it is probably not worth trying to patch a cluster of an ancient legacy system and better replace the entire group itself with something newer.
When you have an arsenal of diagnostics tests to try first, then this point gets moved further back in line (especially if your company has a corporate firewall that blocks most sites) but don’t discard this method of getting solutions because it may turn out that the system problem you’re trying to resolve is a unique one that hasn’t been encountered by anyone on the team before. So in times like those you need to rely on pages on the internet which list solutions to that problem.
If you do successfully find a solution, you can add any new diagnostics tests discovered into your cache of trials for the team to use in the future.
A journal of the problem’s symptoms, the diagnostics attempted and their results, and the solution that resolved the issue long with each entry’s time and date will prove indispensable not only for team members who might have to diagnose the problem several months later and could forget about the details but also to new team members who haven't previously diagnosed this problem.
It will also help show both other staffs who need to be in the know, such as managers, and users and customers, of any problems that have happened, and how long it took to resolve them. Logs play a pivotal role in services’ status pages.
You must be able to tell colleagues and your boss what is the status of the problem resolution, such as which solutions have been applied with what results. A log journal helps immensely in this case.
If you have multiple team members trying different diagnostics, everybody needs to be informed whether the diagnostics worked or not to adjust the remaining set of diagnostics to use accordingly. Email and IRC are two particularly useful mediums for announcing statuses to everyone.
One-on-one talking can be used by the tester when somebody needs more specific symptoms and results to use for their diagnostics.
Generally, group conferences are not well-suited for status updates because they hold back everyone in the team, including ones that could’ve used the time to try other diagnostics. They are helpful, though, after the failure has been resolved to discuss what steps need to be taken to avoid a similar failure again.
Even if you are diagnosing problems by yourself in an individual setting, such as a personal project, where you don't report to someone, it's still essential you possess the communication skills necessary to explain the problem to affected users.
System failures require a detailed inspection to find their root cause. Having background knowledge of the problem components, attempting simple diagnostics before a more complicated one, and searching the internet will speed up the time to recover while keeping a log of the problem diagnosis serves as a reference for identical problems which may happen in the future. Finally, good communication will allow teammates to get the information they need to diagnose the issue quicker and users to know any failures that may affect them.