DEV Community

Discussion on: What's the longest you've ever spent debugging a single bug?

Collapse
 
markgawler profile image
Mark Gawler

Over the years time scales have shrunk for both release cycles as well as debugging, I started programing on hard real time systems, using assembly code. Then we would normally achieve one (occasionally two) release a year. The system would was expected to be in service for a minimum of 10 years.

I can think of one intermittent fault on an interface between two systems which took me nearly a decade to find. This wasn't continuous effort but I had at least three attempts at resolving it. By the time I started looking at the bug the system was already a legacy system with a replacement contracted through our competitor wa on its way. I was a Junior engineer known for having an aptitude for low level coding, so I was put to work. Not having access to one of the two systems I could only review the code and write a report.

Two or three years later out customer dug up my report and agreed we could have access to both system, the catch was I only had a single day on site at the opposite end of the country. The nature of intermittent faults when debugging is the fault will not occur, true to form the system ran perfectly for the whole day and not useful information was gained.

As luck would have it our competitor failed to deliver on promises and our legacy system got a life extension and was rehosted on new hardware. I lead the software effort and in went in to service at which point I left the project. Once in service the original intermittent fault came e back with avengeance, our customer was not happy. I go seconded back to help fix the issue, we enhanced our simulator to emulate the other system and started debugging. Eventually I found the issue which we traced to using the wrong entry point in an error recovery routine in the Real Time OS. The programmer some 22 years earlier had types a 5 instead of a 3. The junior engineer who modified the simulator for me was younger than the bug! Having fixed the bug I was reminded of the report I had written nearly 10 years before, which correctly pointed to the exact error routine at fault.