As we all know some bugs are easier to find and fix than others. Most bugs can be easily found by reasoning about the code after observing unexpected behaviour. Some bugs are trickier but using a debugger and/or carefully placed log messages will illuminate them. But some bugs are heisenbugs - bugs that change behaviour when you try to analyse them. Sometimes these bugs may even seem to disappear when you modify the code to add logging messages!
I have diagnosed some fairly tricky bugs in my time including: deadlocks, lock contention, live-locks, and race conditions. I have found two officially verified bugs in 3rd party software: one was an XOrg bug and the other was a .NET WCF Framework bug.
So I like to think of myself as a relatively good debugger. Once upon a time, I used to think that debugging was more art than science. However, I now apply more rigour to the process and I thought I would document it for future reference.
I think 80% of the work in debugging heisenbugs revolves around trying to figure out how predictably reproduce it. If you can document an exact set of steps taken to reproduce a bug, you have basically won the game in my opinion. It will only be a relatively short period of time before a reproducible bug is found and fixed.
Reproducing a heisenbug
1. Start a technical journal
This is where you will ask yourself questions and track your time. It is useful to stop yourself from running around in circles when dealing with very difficult bugs. It is also useful post-op to work out the total cost of fixing the bug.
2. Gather diagnostics
I collect logs, events, and other diagnostic data for any software systems related to the bug and bundle them up in one place.
3. Document a timeline
I try not to make assumptions about what actually happened until I at least build a timeline of events that led up to the bug based on the logging and diagnostic data I collected.
4. Try to reproduce the bug in the same environment
When it comes to software architectures using multiple threads, locks, and multiple processes, I tend to think more in terms of the probability of the bug occurring. I find I may have to initially repeat the exact same steps in the exact same environment many times to see the bug once. So after I have documented a timeline of events, I usually to try the same operation that led to the bug at least 10 times in an environment as similar as possible to the reported environment. Where I have had to investigate critical bugs, I have even had to develop code to automate the buggy operation and detect the presence of the bug because the bug initially only occurred only once every ten thousand operations.
If I still can't reproduce the bug. It's time to take a step back and look at the bigger picture.
5. Draw a boundary between the code that could be involved and code that is probably not involved
If the bug doesn't surface, I try to take time to read through the code that may be relevant to the bug. What I aim to do at this stage is to group the code into "likely to be related to the bug" and "unlikely to be related to the bug".
6. Think about other conditions that could affect the buggy operation
Against the background of the code I now have within my sights, I think about other conditions that could contribute to the emergence of the bug. There are usually many variables that can play a role in a bug but, as an example, some of these include:
- What were the other users and software systems doing at the time?
- What was the CPU, memory, disk load like on the client and server?
- What was the network performance at the time?
Sometimes investigating these other variables can be fruitless but at the very least it usually triggers some thoughts about what might be happening. Even if the process eliminates some variables then that will also ultimately help.
7. Increase load
Next I either use a load tool (to artificially put pressure on the memory, CPU, and/or disk) or try to load the software systems up with comparable load to when the bug was reported (if I haven't already done so). If I suspect a network glitch was involved, I use a tool to slow and/or drop network packets between software systems.
8. Reduce load
Sometimes the problem doesn't surface under high load but does under low load so next I will try to reduce the load.
9. Decision time
If I have done the previous steps and the bug still hasn't surfaced, I need to make a decision: do I continue to put time into reproducing it or do I switch to the "monitor for re-occurrence" phase?
9a. Continue looking for bug
At this stage I will talk to other senior developers about the problem to see if that triggers some ideas. If you are a junior dev then you will probably do this stage straight up but if you are senior dev you probably leave this until you have had a good look around first.
9b. Monitor for re-occurrence
If there is a lack of diagnostic and log information, monitoring for re-occurrence is a good option. However this step should involve some code changes. I usually schedule, implement, and test diagnostic data collection that will help isolate the problem. This can introduce a performance risk which needs to be traded off against the severity of the bug.
Finding the cause of a heisenbug
Once the steps to reproduce a bug are repeatable to at least a 1 in 3 chance, I know I am on the home straight. All I need to do is progressively shrink that code boundary down to such an extent that I can either use carefully placed logging messages or a debugger to work out precisely what is going wrong.
Fixing a heisenbug
Once you know what is going on, the ideal next step is to write a test case that proves the existence of the bug, fix the bug, then show the test case now passes. This is not always possible (for example, in the case of some race-conditions) so in these cases some refinement of the manual testing procedures may need to occur.
The bug, however, should also be analysed in detail because it is like a sentinel indicating the presence of diseased code. Sometimes the disease is minor and is relatively low risk to fix. Sometimes the disease extends beyond a unit of code and sometimes it indicates an architectural flaw. So how I fix the bug really depends on the extent of the disease.
Testing a heisenbug fix
Based on the probability of the bug occurring, I usually confirm the same test passes a number of times. For example, if the bug had a 1 in 3 chance of occurring, then I (or preferably someone else) will run the same test 20 times to verify the bug is now fixed (automating the process if possible). Any regression tests associated with the operation are run. Regression tests for any other operations affected by the change are also run. I think the code reviewer should do some unstructured testing around the buggy operation if time permits or the bug is critical.
By starting with a system that matches as close as possible to the system on which the heisenbug was reported and trying to methodically narrow the scope of code that can be in play when bug occurs, I can gradually hone in on the cause of the bug. If the bug submerges, I can easily step back (because I have documented my progress in a journal) and come at it from a different angle. When it comes to fixing the bug, I can use the probability of it occurring to verify that it has been fixed.